Title: Using Graphs in Unstructured and Semistructured Data Mining
1Using Graphs in Unstructuredand Semistructured
Data Mining
- Soumen Chakrabarti
- IIT Bombay
- www.cse.iitb.ac.in/soumen
2Acknowledgments
- C. Faloutsos, CMU
- W. Cohen, CMU
- IBM Almaden (many colleagues)
- IIT Bombay (many students)
- S. Sarawagi, IIT Bombay
- S. Sudarshan, IIT Bombay
3Graphics are everywhere
- Phone network, Internet, Web
- Databases, XML, email, blogs
- Web of trust (epinion)
- Text and language artifacts (WordNet)
- Commodity distribution networks
Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez 91
4Why analyze graphs?
- What properties do real-life graphs have?
- How important is a node? What is importance?
- Who is the best customer to target in a social
network? - Who spread a raging rumor?
- How similar are two nodes?
- How do nodes influence each other?
- Can I predict some property of a node based on
its neighborhood?
5Outline, some more detail
- Part 1 (Modeling graphs)
- What do real-life graphs look like?
- What laws govern their formation, evolution and
properties? - What structural analyses are useful?
- Part 2 (Analyzing graphs)
- Modeling data analysis problems using graphs
- Proposing parametric models
- Estimating parameters
- Applications from Web search and text mining
6Modeling and generatingrealistic graphs
7Questions
- What do real graphs look like?
- Edges, communities, clustering effects
- What properties of nodes, edges are important to
model? - Degree, paths, cycles,
- What local and global properties are important to
measure? - How to artificially generate realistic graphs?
8Modeling why care?
- Algorithm design
- Can skewed degree distribution make our algorithm
faster? - Extrapolation
- How well will Pagerank work on the Web 10 years
from now? - Sampling
- Make sure scaled-down algorithm shows same
performance/behavior on large-scale data - Deviation detection
- Is this page trying to spam the search engine?
9Laws degree distributions
- Q avg degree is 10 - what is the most probable
degree?
count
??
degree
10
10Laws degree distributions
- Q avg degree is 10 - what is the most probable
degree?
degree
11Power-law outdegree O
Frequency
Exponent slope
O -2.15
-2.15
Nov97
Outdegree
- The plot is linear in log-log scale FFF99
- freq degree (-2.15)
12Power-law rank R
outdegree
Exponent slope R -0.74
R
Dec98
Rank nodes in decreasing outdegree order
- The plot is a line in log-log scale
13Eigenvalues
- Let A be the adjacency matrix of graph
- The eigenvalue ? is
- A v ? v, where v some vector
- Eigenvalues are strongly related to graph
topology
14Power-law eigenvalues of E
- Eigenvalues in decreasing order
Eigenvalue
Exponent slope
E -0.48
Dec98
Rank of decreasing eigenvalue
15The Node Neighborhood
- N(h) of pairs of nodes within h hops
- Let average degree 3
- How many neighbors should I expect within 1,2, h
hops? - Potential answer
- 1 hop -gt 3 neighbors
- 2 hops -gt 3 3
-
- h hops -gt 3h
16The Node Neighborhood
- N(h) of pairs of nodes within h hops
- Let average degree 3
- How many neighbors should I expect within 1,2, h
hops? - Potential answer
- 1 hop -gt 3 neighbors
- 2 hops -gt 3 3
-
- h hops -gt 3h
WRONG!
WE HAVE DUPLICATES!
17The Node Neighborhood
- N(h) of pairs of nodes within h hops
- Let average degree 3
- How many neighbors should I expect within 1,2, h
hops? - Potential answer
- 1 hop -gt 3 neighbors
- 2 hops -gt 3 3
-
- h hops -gt 3h
WRONG x 2!
avg degree meaningless!
18Power-law hop-plot H
of Pairs
H 2.83
H 4.86
of Pairs
Dec 98
Hops Router level 95
Hops
- Pairs of nodes as a function of hops N(h) hH
19Observation
- Q Intuition behind hop exponent?
- A intrinsicfractal dimensionality of the
network
N(h) h1
N(h) h2
20Any other laws?
- Bow-tie, for the Web Kumar 99
- IN, SCC, OUT, tendrils
- Disconnected components
21Generators
- How to generate graphs from a realistic
distribution? - Difficulty simultaneously preserving many local
and global properties seen in realistic graphs - Erdos-Renyi switch on each edge independently
with some probability - Problem degree distribution not power-law
- Degree-based
- Process-based (preferential attachment)
22Degree-based
- Fix the degree distribution (e.g., Zipf)
- Assign degrees to nodes
- Add matching edges to satisfy degrees
- No control over other properties
23Process-based Preferential attachment
- Start with a clique with m nodes
- Add one node v at every time step
- v makes m links to old nodes
- Suppose old node u has degree d(u)
- Let pu d(u)/ ?wd(w)
- v invokes a multinomial distribution defined by
the set of ps - And links to whichever us show up
- What is the degree distribution?
24Bipartite cores
- Problem with preferential attachment does not
explain dense/complete bipartite cores - 100,000s in O(20 million)-page crawl
- Need a better generation process
Log(count)
n1
n3
n2
23 core (mn core)
Log(m)
25Other process-based generators
- Copying model Kleinberg1999
- Sample a node v to which we must add k links
- W.p. b add k links to nodes picked u.a.r.
- W.p. 1b choose a random reference node r
- Copy k links from r to v
- Much more difficult to analyze
- Reference node ? compression techniques!
- Fabrikant, 02 H.O.T. connect to closest,
high connectivity neighbor - Pennock, 02 Winner does not take all
26R-MAT Graph generator package
- Recursive MATrix generator Chakrabarti,04
- Goals
- Power-law in- and out-degrees
- Power law eigenvalues
- Small diameter
- Few parameters
27Graph Patterns
Count vs edge-stress
28R-MAT
- Subdivide the adjacency matrix
- choose a quadrant with probability (a,b,c,d)
29R-MAT
- Recurse till we reach a 11 cell
a
b
a
c
d
d
c
30R-MAT
- by construction
- rich-get-richer for in-degree
- ....................... for out-degree
- communities within communities
- and
- small diameter
31Experiments (Clickstream)
Count vs Indegree
Count vs Outdegree
Hop-plot
Singular value vs Rank
Left Network value
Right Network value
R-MAT matches it well
32Power laws all over
- Bible rank vs. word frequency
- Length of file transfers Bestavros
- Web hit counts Huberman
- Click-stream data Montgomery01
- Lotkas law of publication count (CiteSeer data)
33Resources
- Generators
- RMAT (deepay_at_cs.cmu.edu)
- BRITE www.cs.bu.edu/brite/
- INET topology.eecs.umich.edu/inet
- Visualization tools
- Graphviz www.graphviz.org
- Pajek vlado.fmf.uni-lj.si/pub/networks/pajek
- Kevin Bacon web sitewww.cs.virginia.edu/oracle
- Erdos numbers etc.
34Outline, some more detail
- Part 1 (Modeling graphs)
- What do real-life graphs look like?
- What laws govern their formation, evolution and
properties? - What structural analyses are useful?
- Part 2 (Analyzing graphs)
- Modeling data analysis problems using graphs
- Proposing parametric models
- Estimating parameters
- Applications from Web search and text mining
35Centrality and prestige
36How important is a node?
- Degree, min-max radius,
- Pagerank
- Maximum entropy network flows
- HITS and stochastic variants
- Stability and susceptibility to spamming
- Hypergraphs and nonlinear systems
- Using other hypertext properties
- Applications Ranking, crawling, clustering,
detecting obsolete pages
37Motivating problem
- Given a graph, find its most interesting/central
node
A node is important, if it is connected with
important nodes (recursive, but OK!)
38Motivating problem pageRank solution
- Given a graph, find its most interesting/central
node - Proposed solution Random walk spot most
popular node (-gt steady state prob. (ssp))
A node has high ssp, if it is connected with
high ssp nodes (recursive, but OK!)
39(Simplified) PageRank algorithm
- Let A be the transition matrix ( adjacency
matrix) let AT become column-normalized - then
From
AT
To
40(Simplified) PageRank algorithm
AT p p
2
1
3
4
5
41(Simplified) PageRank algorithm
- AT p 1 p
- thus, p is the eigenvector that corresponds to
the highest eigenvalue (1, since the matrix is
column-normalized)
42(Simplified) PageRank algorithm
- In short imagine a particle randomly moving
along the edges - compute its steady-state probabilities (ssp)
- Full version of algo with occasional random
jumps see later
43Intuition
- A as vector transformation
x
x
x
x
1
3
2
1
44Intuition
- By defn., eigenvectors remain parallel to
themselves (fixed points)
v1
v1
l1
3.62
45Convergence
- Usually, fast
- depends on ratio
- l1 l2
l2
l1
46Prestige as Pagerank BrinP1997
- Maxwells equation for the Web
- PR converges only if E is aperiodic and
irreducible make it so - d is the (tuned) probability of teleporting to
one of N pages uniformly at random - (Possibly) unintended consequences topic
sensitivity, stability
u
v
OutDegree(u)3
47Prestige as network flow
- yij surfers clicking from i to j per unit time
- Hits per unit time on page j is
- Flow is conserved at
- The total traffic is
- NormalizeCan interpret pij as a probability
- Standard Pagerank corresponds to one solution
- Many other solutions possible
48Maximum entropy flow Tomlin2003
- Flow conservation modeled using feature
- And the constraints
- Goal is to maximizesubject to
- Solution has form
- ?i is the hotness of page i
49Maxent flow results
?i ranking is better than Pagerank Hiranking
is worse
Two IBMintranetdata setswith knowntop URLs
Depth up to whichdmoz.org URLs areused as
ground truth
Averagerank (106) of knowntop URLswhensorted
byPagerank
Hi
?i
(Smaller rank is better)
Average rank (108)
50Nodes influencing neighbors
- Q1 How does a virus spread across an arbitrary
network? - Q2 will it create an epidemic?
- Susceptible-Infected-Susceptible (SIS) model
- Cured nodes immediately become susceptible
51The model
- (virus) Birth rate b probability than an
infected neighbor attacks - (virus) Death rate d probability that an
infected node heals
Healthy
N2
N
N1
Infected
N3
52The model
Healthy
N2
N
N1
Infected
N3
53Epidemic threshold t
- of a graph, defined as the value of t, such that
- if strength s b / d lt t
- an epidemic can not happen
- Thus,
- given a graph
- compute its epidemic threshold
54Epidemic threshold t
- What should t depend on?
- avg. degree? and/or highest degree?
- and/or variance of degree?
- and/or third moment of degree?
55Epidemic threshold
- Theorem We have no epidemic, if
ß/d ltt 1/ ?1,A
56Epidemic threshold
- Theorem We have no epidemic, if
epidemic threshold
recovery prob.
ß/d ltt 1/ ?1,A
largest eigenvalue of adj. matrix A
attack prob.
Proof Wang03
57Experiments (Oregon)
b/d gt t (above threshold)
b/d t (at the threshold)
b/d lt t (below threshold)
58HITS Kleinberg1997
- Two kinds of prestige
- Good hubs link to good authorities
- Good authorities are linked to by good hubs
- Eigensystems of EET (h) and ETE (a)
- Whereas Pagerank uses the eigensystem of where
- Query-specific graph drop same-site links
59Kleinbergs algorithm
- Step 1 expand by one move forward and backward
60Kleinbergs algorithm
- give high score ( authorities) to nodes that
many important nodes point to - give high importance score (hubs) to nodes that
point to good authorities)
hubs
authorities
61Kleinbergs algorithm
- Observations
- recursive definition!
- each node (say, i-th node) has both an
authoritativeness score ai and a hubness score hi
62Kleinbergs algorithm
- Let A be the adjacency matrix
- the (i,j) entry is 1 if the edge from i to j
exists - Let h and a be n x 1 vectors with the
hubness and authoritativiness scores. - Then
63Kleinbergs algorithm
- In conclusion, we want vectors h and a such that
- h A a
- a AT h
- That is
- a ATA a
64Kleinbergs algorithm
- a is a right- singular vector of the adjacency
matrix A (by dfn!) - eigenvector of ATA
- Starting from random a and iterating, well
eventually converge - (Q to which of all the eigenvectors? why?)
65Dyadic interpretation CohnC2000
- Graph includes many communities z
- QueryJaguar gets auto, game, animal links
- Each URL is represented as two things
- A document d
- A citation c
- Max
- Guess number of aspects zs and use Hofmann 1999
to estimate Pr(cz) - These are the most authoritative URLs
66Dyadic results for Machine learning
Clustering based on citations ranking within
clusters
67Spamming link-based ranking
- Recipe for spamming HITS
- Create a hub linking to genuine authorities
- Then mix in links to your customers sites
- Highly susceptible to adversarial behavior
- Recipe for spamming Pagerank
- Buy a bunch of domains, cloak IP addresses
- Host a site at each domain
- Sprinkle a few links at random per page to other
sites you own - Takes more work than spamming HITS
68Stability of link analysis NgZJ2001
- Compute HITS authority scores and Pagerank
- Delete 30 of nodes/links at random
- Recompute and compare ranks repeat
- Pagerank ranks more stable than HITS authority
ranks - Why?
- How to design more stable algorithms?
HITS Authority
Pagerank
69Stability depends on graph and params
- Auth score is eigenvector for ETE S, say
- Let ?1 gt ?2 be the first two eigenvalues
- There exists an S such that
- S and S are close SSF O(?1 ?2)
- But u1 u12 ?(1)
- Pagerank p is eigenvector of
- U is a matrix full of 1/N and ? is the jump prob
- If set C of nodes are changed in any way, the new
Pagerank vector p satisfies
70Randomized HITS
- Each half-step, with probability ?, teleport to a
node chosen uniformly at random - Much more stable than HITS
- Results more meaningful
- ? near 1 will always stabilize
- Here ? was 0.2
Randomized HITS
Pagerank
71Another random walk variation of HITS
- SALSA Stochastic HITS Lempel2000
- Two separate random walks
- From authority to authority via hub
- From hub to hub via authority
- Transition probability Pr(ai?aj)
- If transition graph is irreducible,
- For disconnected components, depends on relative
size of bipartite cores - Avoids dominance of larger cores
a1
72SALSA sample result (movies)
HITS The Tightly-Knit Community (TKC) effect
SALSA Less TKC influence (but no reinforcement!)
73Links in relational data GibsonKR1998
- (Attribute, value) pair is a node
- Each node v has weight wv
- Each tuple is a hyperedge
- Tuple r has weight xr
- HITS-like iterations to update weight wv
- For each tuple
- Update weight
- Combining operator ? can be sum, max, product, Lp
avg, etc.
74Distilling links in relational data
Theory
Database
Author
Author
Forum
Year
75Searching and annotating graph data
76Searching graph data
- Nodes in graph contain text
- Random?Intelligent surfer RichardsonD2001
- Topic-sensitive Pagerank Haveliwala2002
- Assigning image captions using random walks
PanYFD2004 - Rotting pages and links BarYossefBKT2004
- Query is a set of keywords
- All keywords may not match a single node
- Implicit joins Hulgeri2001, Agrawal2002
- Or rank aggregation Balmin2004 required
77Intelligent Web surfer
Keyword
Probabilityof teleportingto node j
Probability of walking from i to j wrt q
Relevanceof node k wrt q
Pick out-link to walk on inproportion to
relevance oftarget out-neighbor
Querysetof words
Pick a query word per some distribution, e.g. IDF
78Implementing the intelligent surfer
- PRQ(j) approximates a walk that picks a query
keyword using Pr(q) at every step - Precompute and store Prq(j) for each keyword q in
lexicon space blowup avg doc length - Query-dependent PR rated better by volunteers
79Topic-sensitive Pagerank
- High overhead for per-word Pagerank
- Instead, compute Pageranks for some collection of
broad topics PRc(j) - Topic c has sample page set Sc
- Walk as in Pagerank
- Jump to a node in Sc uniformly at random
- Project query onto set of topics
- Rank responses by projection-weighted Pageranks
80Topic-sensitive Pagerank results
- Users prefer topic-sensitive Pagerank on most
queries to global Pagerank keyword filter
81Image captioning
- Segment images into regions
- Image has caption words
- Three-layer graph image, regions, caption words
- Threshold on region similarity to connect regions
(dotted)
82Random walks with restarts
Regions
Images
Testimage
Words
- Find regions in test image
- Connect regions to other nodes in the region
layer using region similarity - Random walk, restarting at test image node
- Pick words with largest visit probability
83More random walks Link rot
- How stale is a page?
- Last-mod unreliable
- Automatic dead-link cleaners mask disuse
- A page is completely stale if it is dead
- Let D be the set of pages which cannot be
accessed (404 and other problems) - How stale is a page u? Start with p ? u
- If p?D declare decay value of u to be 1, else
- With probability ? declare decay value of u 0
- W.p. 1? choose outlink v, set p?v, loop
84Page staleness results
Decay
404s
- Decay score is correlated with, but generally
larger than the fraction of dead outlinks on a
page - Removing direct dead links automatically does not
eliminate live but rotting pages
85Graph proximity search two paradigms
- A single node as query response
- Find node that matches query terms
- or is near nodes matching query terms
- Goldman 1998
- A connected subgraph as query response
- Single node may not match all keywords
- No natural page boundary Bhalotia2002
Agrawal2002
86Single-node response examples
- Travolta, Cage
- Actor, Face/Off
- Travolta, Cage, Movie
- Face/Off
- Kleiser, Movie
- Gathering, Grease
- Kleiser, Woo, Actor
- Travolta
Movie
is-a
Face/Off
Grease
Gathering
acted-in
Travolta
Cage
A3
directed
is-a
Actor
Kleiser
Woo
is-a
Director
87Basic search strategy
- Node subset A activated because they match query
keyword(s) - Look for node near nodes that are activated
- Goodness of response node depends
- Directly on degree of activation
- Inversely on distance from activated node(s)
88Proximity query screenshot
http//www.cse.iitb.ac.in/banks/
89Ranking a single node response
- Activated node set A
- Rank node r in response set R based on
proximity to nodes a in A - Nodes have relevance ?R and ?A in 0,1
- Edge costs are specified by the system
- d(a,r) cost of shortest path from a to r
- Bond between a and r
- Parameter t tunes relative emphasis on distance
and relevance score - Several ad-hoc choices
90Scoring single response nodes
- Additive
- Belief
- Goal list a limited number of find nodes with
the largest scores - Performance issues
- Assume the graph is in memory?
- Precompute all-pairs shortest path (V 3)?
- Prune unpromising candidates?
91Hub indexing
- Decompose APSP problem using sparsevertex cuts
- AB shortest paths to p
- AB shortest paths to q
- d(p,q)
- To find d(a,b) compare
- d(a?p?b) not through q
- d(a?q?b) not through p
- d(a?p?q?b)
- d(a?q?p?b)
- Greatest savings when A?B
- Heuristics to find cuts, e.g. large-degree nodes
A
B
p
a
b
q
92ObjectRank Balmin2004
- Given a data graph with nodes having text
- For each keyword precompute a keyword-sensitive
Pagerank RichardsonD2001 - Score of a node for multiple keyword search based
on fuzzy AND/OR - Approximation to Pagerank of node with restarts
to nodes matching keywords - Use Fagin-merge Fagin2002 to get best nodes in
data graph
93Connected subgraph as response
- Single node may not match all keywords
- No natural page boundary
- On-the-fly joins make up a response page
- Two scenarios
- Keyword search on relational data
- Keywords spread among normalized relations
- Keyword search on XML-like or Web data
- Keywords spread among DOM nodes and subtrees
94Keyword search on relational data
- Tuple node
- Some columns have text
- Foreign key constraints edges in schema graph?
- Query set of terms
- No natural notionof a document
- Normalization
- Join may be needed to generate results
- Cycles may exist in schema graph Cites
Cites
Paper
CitingCited? ? ?
PaperIDPaperName? ? ?
Writes
Author
AuthorIDPaperID? ? ?
AuthorIDAuthorName? ? ?
95DBXplorer and DISCOVER
- Enumerate subsets of relations in schema graph
which, when joined, may contain rows which have
all keywords in the query - Join trees derived from schema graph
- Output SQL query for each join tree
- Generate joins, checking rows for matches
- Agrawal2001, Hristidis2002
T4
K1,K2,K3
T2
T3
T4
T2
T5
T1
T2
T3
K2
T4
T2
T3
T5
T2
T3
T5
K3
96Discussion
- Exploits relational schema information to contain
search - Pushes final extraction of joined tuples into
RDBMS - Faster than dealing with full data graph directly
- Coarse-grained ranking based on schema tree
- Does not model proximity or (dis) similarity of
individual tuples - No recipe for data with less regular (e.g. XML)
or ill-defined schema
97Motivation from Web search
- Linux modem driver for a Thinkpad A22p
- Hyperlink path matches query collectively
- Conjunction query would fail
- Projects where X and P work together
- Conjunction may retrieve wrong page
- General notion of graph proximity
- Thinkpad
- Drivers
- Windows XP
- Linux
- Download
- Installation tips
- Modem
- Ethernet
- The B System
- Group members
- P
- S
- X
- Home Page ofProfessor X
- Papers
- VLDB
- Students
- P
- Q
Ps home page I work on the B project.
98Data structures for search
- Answer tree with at least one leaf containing
each keyword in query - Group Steiner tree problem, NP-hard
- Query term t found in source nodes St
- Single-source-shortest-path SSSP iterator
- Initialize with a source (near-) node
- Consider edges backwards
- getNext() returns next nearest node
- For each iterator, each visited node v maintains
for each t a set v.Rt of nodes in St which have
reached v
99Generic expanding search
- Near node sets St with S ?t St
- For all source nodes ? ? S
- create a SSSP iterator with source ?
- While more results required
- Get next iterator and its next-nearest node v
- Let t be the term for the iterators source s
- crossProduct s ? ?t ?tv.Rt
- For each tuple of nodes in crossProduct
- Create an answer tree rooted at v with paths to
each source node in the tuple - Add s to v.Rt
100Search example (Vu Kleinberg)
Quoc Vu
Jon Kleinberg
author
writes
cites
paper
101First response
Quoc Vu
Jon Kleinberg
writes
writes
writes
Organizing Web pagesby Information Unit
Authoritative sources in ahyperlinked environment
cites
A metriclabeling problem
writes
cites
cites
Divyakant Agrawal
writes
Eva Tardos
author
writes
cites
paper
102Subgraph search screenshot
http//www.cse.iitb.ac.in/banks/
103Similarity, neighborhood, influence
104Why are two nodes similar?
- What is/are the best paths connecting two nodes
explaining why/how they are related? - Graph of co-starring, citation, telephone call,
- Graph with nodes s and t budget of b nodes
- Find best b nodes capturing relationship
between s and t FaloutsosMT2004 - Proposing a definition of goodness
- How to efficiently select best connections
Negroponte
Palmisano
Esther Dyson
Gerstner
105Simple proposals that do not work
- Shortest path
- Pizza boy p gets same attention as g
- Network flow
- s?a?b?t is as good as s?g?t
- Voltage
- Connect 1V at s, ground t
- Both g and p will be at 0.5V
- Observations
- Must reward parallel paths
- Must reward short paths
- Must penalize/tax pizza boys
a
b
s
g
t
p
106Resistive network with universal sink
- Connect 1V to s
- Ground t
- Introduce universal sink
- Grounded
- Connected to every node
- Universal sink is a tax collector
- Penalizes pizza boys
- Penalizes long paths
- Goodness of a path is the electric current it
carries
a
b
s
g
t
p
Connectedto every node
107Resistive network algorithm
- Ohms law
- Kirchhoffs current law
- Boundary conditions (without sink)
- Solution
- Here C(u,v) is the conductance from u to v
- Add grounded universal sink z with V(z)0
- Set
- Display subgraph carrying high current
108Distributions coupled via graphs
- Hierarchical classification
- Document topics organized in a tree
- Mapping between ontologies
- Can Dmoz label help labeling in Yahoo?
- Hypertext classification
- Topic of Web page better predicted from hyperlink
neighborhood - Categorical sequences
- Part-of-speech tagging, named entity tagging
- Disambiguation and linkage analysis
109Hierarchical classification
- Obvious approaches
- Flatten to leaf topics, losing hierarchy info
- Level-by-level, compounding error probability
- Cascaded generative model
- Pr(cd,r) estimated as Pr(cr)Pr(dc)/Z(r)
- Estimate of Pr(dc) makes naïve independence
assumptions if d has high dimensionality - Pr(cd,r) tends to 0/1 for large dimensions and
- Mistake made at shallow levels become irrevocable
r
c
110Global discriminative model
- Each node has an associated bit X
- Propose a parametric form
- Each training instance sets one path to 1, all
other nodes have X0
T
xr
d
F(d,xr)
xr0
xr1
2T1
wc
111Hypertext classification
- cclass, ttext, Nneighbors
- Text-only model Prtc
- Using neighbors textto judge my topicPrt,
t(N) c - Better modelPrt, c(N) c
- Non-linear relaxation
?
112Generative graphical model results
- 9600 patents from 12 classes marked by USPTO
- Patents have text and cite other patents
- Expand test patent to include neighborhood
- Forget fraction of neighbors classes
113Discriminative graphical model
- OA(X) direct own attributes of node X
- LD(X) link-derived attributes of node X
- Mode-link most frequent label of neighbors(X)
- Count-link histogram of neighbor labels
- Binary-link 0/1 histogram of neighbor labels
- Iterate as in generative case
Local model params
Neighborhood model params
114Discriminative model results Li2003
- Binary-link and count-link outperform
content-only at 95 confidence - Better to separately estimate wl and wo
- InOutCocitation better than any subset for LD
115Undirected Markov networks
- Clique c?C(G) a set of completely connected nodes
- Clique potential ?c(Vc) a function over all
possible configurations of nodes in Vc - Decompose Pr(v) as (1/Z)?c?C(G)?c(Vc)
- Parametric form
Label coupling
Instance
Local feature variable
Label variable
Params of model
Feature functions
116Conditional and relational networks
- x vector of observed features at all nodes
- y vector of labels
- A set of clique templatesspecifying links to
use - Other features in the sameHTML section
117Special case sequential networks
- Text modeled as sequence of tokens drawn from a
large but finite vocabulary - Each token has attributes
- Visible allCaps, noCaps, hasXx, allDigits,
hasDigit, isAbbrev, (part-of-speech, wnSense) - Not visible part-of-speech, (isPersonName,
isOrgName, isLocation, isDateTime),
startscontinuesends-noun-phrase - Visible (symbols) and invisible (states)
attributes of nearby tokens are dependent - Application decides what is (not) visible
- Goal Estimate invisible attributes
118Hidden Markov model
- A generative sequential model for the joint
distribution of states (s) and symbols (o)
St-1
St
St1
...
...
Ot
Ot1
Ot-1
119Using redundant token features
- Each o is usually a vector of features extracted
from a token - Might have high dependence/redundancy hasCap,
hasDigit, isNoun, isPreposition - Parametric model for Pr(st?ot) needs to make
naïve assumptions to be practical - Overall joint model Pr(s,o) can be very
inaccurate - (Same argument as in naïve Bayes vs. SVM or
maximum entropy text classifiers)
120Discriminative graphical model
- Assume one-stage Markov dependence
- Propose direct parametric form for conditional
probability of state sequence given symbol
sequence
Model
Log-linear form
Feature function mightdepend on whole o
Parameters to fit
121Feature functions and parameters
Penalizelarge params
Maximize total conditional likelihood over all
instances
- Find ?L/??k for each k and perform a
gradient-based numerical optimization - Efficient for linear state dependence structure
122Conditional vs. joint results
Out-of-vocabulary error
Orthography Use words, plus overlapping
features isCap, startsWithDigit, hasHyphen,
endsWith -ing, -ogy, -ed, -s, -ly, -ion, -tion,
-ity, -ies
123Summary
- Graphs provide a powerful way to model many kinds
of data, at multiple levels - Web pages, XML, relational data, images
- Words, senses, phrases, parse trees
- A few broad paradigms for analysis
- Factors affecting graph evolution over time
- Eigen analysis, conductance, random walks
- Coupled distributions between node attributes and
graph neighborhood - Several new classes of model estimation and
inferencing algorithms
124References
- BrinP1998 The Anatomy of a Large-Scale
Hypertextual Web Search Engine, WWW. - GoldmanSVG1998 Proximity search in databases.
VLDB, 2637. - ChakrabartiDI1998 Enhanced hypertext
categorization using hyperlinks. SIGMOD. - BikelSW1999 An Algorithm that Learns Whats in
a Name. Machine Learning Journal. - GibsonKR1999 Clustering categorical data An
approach based on dynamical systems. VLDB. - Kleinberg1999 Authoritative sources in a
hyperlinked environment. JACM 46.
125References
- CohnC2000 Probabilistically Identifying
Authoritative Documents, ICML. - LempelM2000 The stochastic approach for
link-structure analysis (SALSA) and the TKC
effect. Computer Networks 33 (1-6) 387-401 - RichardsonD2001 The Intelligent Surfer
Probabilistic Combination of Link and Content
Information in PageRank. NIPS 14 (1441-1448). - LaffertyMP2001 Conditional Random Fields
Probabilistic Models for Segmenting and Labeling
Sequence Data. ICML. - BorkarDS2001 Automatic text segmentation for
extracting structured records. SIGMOD.
126References
- NgZJ2001 Stable algorithms for link analysis.
SIGIR. - Hulgeri2001 Keyword Search in Databases. IEEE
Data Engineering Bulletin 24(3) 22-32. - Hristidis2002 DISCOVER Keyword Search in
Relational Databases. VLDB. - Agrawal2002 DBXplorer A system for
keyword-based search over relational databases.
ICDE. - TaskarAK2002 Discriminative probabilistic
models for relational data. - Fagin2002 Combining fuzzy information an
overview. SIGMOD Record 31(2), 109118.
127References
- Chakrabarti2002 Mining the Web Discovering
Knowledge from Hypertext Data - Tomlin2003 A New Paradigm for Ranking Pages on
the World Wide Web. WWW. - Haveliwala2003 Topic-Sensitive Pagerank A
Context-Sensitive Ranking Algorithm for Web
Search. IEEE TKDE. - LuG2003 Link-based Classification. ICML.
- FaloutsosMT2004 Connection Subgraphs in Social
Networks. SIAM-DM workshop. - PanYFD2004 GCap Graph-based Automatic Image
Captioning. MDDE/CVPR.
128References
- Balmin2004 Authority-Based Keyword Queries in
Databases using ObjectRank. VLDB. - BarYossefBKT2004 Sic transit gloria telae
Towards an understanding of the Webs decay.
WWW2004.