Title: Building Optimal Websites with the Constrained Subtree Selection Problem
1Building Optimal Websites with the Constrained
Subtree Selection Problem
- Brent Heeringa
- (joint work with Micah Adler)
- 09 November 2004
2A website design problem(for example a new
kitchen store)
- Given products, their popularity, and their
organization - How do we create a good website?
- Navigation is natural
- Access to information is timely
3Good website Natural Navigation
- Organization is a DAG
- TC of DAG enumerates all viable categorical
relationships and introduces shortcuts - Subgraph of TC preserves logical relationship
between categories
TC
4Good website Timely Access to Info
- Two obstacles to finding info quickly
- Time scanning a page for correct link
- Time descending the DAG
- Associate a cost with each obstacle
- Page cost (function of out-degree of node)
- Path cost (sum of page costs on path)
- Good access structure
- Minimize expected path cost
- Optimal subgraph is always a full tree
1/2
Page Cost links Path Cost 325 Weighted
Path Cost 5/2
5Constrained Subtree Selection (CSS)
- An instance of CSS is a triple (G,?,w)
- G is a rooted, DAG with n leaves (constraint
graph) - ? is a function of the out-degree of each
internal node (degree cost) - w is a probability distribution over the n
leaves (weights) - A solution is any directed subtree of the
transitive closure of G which includes the root
and leaves - An optimal solution is one which minimizes the
expected path cost
1/4
1/4
1/4
1/4
?(x)x
6Constrained Subtree Selection (CSS)
- An instance of CSS is a triple (G,?,w)
- G is a rooted, DAG with n leaves (constraint
graph) - ? is a function of the out-degree of each
internal node (degree cost) - w is a probability distribution over the n
leaves (weights) - A solution is any directed subtree of the
transitive closure of G which includes the root
and leaves - An optimal solution is one which minimizes the
expected path cost
1/4
1/4
1/4
1/4
3(1/4)
5(1/4)
5(1/4)
3(1/4)
?(x)x Cost4
7Constrained Subtree Selection (CSS)
- An instance of CSS is a triple (G,?,w)
- G is a rooted, DAG with n leaves (constraint
graph) - ? is a function of the out-degree of each
internal node (degree cost) - w is a probability distribution over the n
leaves (weights) - A solution is any directed subtree of the
transitive closure of G which includes the root
and leaves - An optimal solution is one which minimizes the
expected path cost
1/2
1/6
1/6
1/6
?(x)x Cost 3 1/2
8Constraint-Free Graphs and k-favorability
- Constraint-Free Graph
- Every directed, full tree with n leaves is a
subtree of the TC - CSS is no longer constrained by the graph
- k-favorable degree cost ?
- Fix ?. There exists kgt1 for any constraint-free
instance of CSS under ? where an optimal tree
has maximal out-degree k
9Linear Degree Cost - ?(x)x
- 3 paths w/ cost 5
- 2 paths w/ cost 4
10Linear Degree Cost - ?(x)x
gt 1/2
- Prefer binary structure when a leaf has at least
- half the mass
- Prefer ternary structure when mass is
- uniformly distributed
- CSS with 2-favorable degree costs and C.F.
graphs is Huffman coding problem - Examples quadratic, exp, ceiling of log
11Results
- Complexity NP-Complete for equal weights and
many ? - Sufficient condition on ?
- Hardness depends on constraint graph
- Highlighted Results
- Theorem O(n?(k)k)-time DP algorithm
- ? is integer-valued, k-favorable and G is
constraint free - ?(x)x
- Theorem poly-time constant-approximation
- ?1 and k-favorable G has constant out-degree
- Approximate Hotlink Assignment - Kranakis et.
al - Other results
- Characterizations of optimal trees for uniform
probability distributions
12Related Work
- Adaptive Websites Perkowitz Etzioni
- Challenge to the AI community
- Novel views of websites Page synthesis problem
- Hotlink Assignment Kranakis, Krizanc, Shende,
et. al. - Add 1 hotlink per page to minimize expected
distance from root to leaves - Recently pages have cost proportional to their
size - Hotlinks dont change page cost
- Optimal Prefix-Free Codes Golin Rote
- Min code for n words with r symbols where symbol
ai has cost ci - Resembles CSS without a constraint graph
13Exact Cover by 3-Sets
INPUT (X,C) X(x1,,xn) n3k and C(C1,,Cm) Ci
? X OUTPUT C ? C where Ck and covers X
QUESTION Given K and (X,C) is there a cover of
size K?
Sufficient condition on ? For every integer
k, there exists an integer s(k) such that
14(X,C) X(x1,,xn) n3k and C(C1,,Cm) Ci ? X
15Lopsided Trees
- Recall ?(x)x, and G is constraint free
- Node level path cost
- Adding an edge increases level
- Grow lopsided trees level by level
16Lopsided Trees
17Lopsided Trees
18Lopsided Trees
19Lopsided Trees
- We know exact cost of tree up to the current
level i - Exact cost of m leaves
- Remaining n-m leaves must have path-cost at
least i
20Lopsided Trees
- Exact cost of C 3 (1/3)1
- Remaining mass up to level 4 (2/3) 4 8/3
- Total 18/311/3
21Lopsided Trees
- Tree cost at Level 5 in terms of Tree cost at
Level 4 - Add in the mass of remaining leaves
- Cost at Level 5
- No new leaves
- 11/32/313/3
22Lopsided Trees
23Lopsided Trees
24Lopsided Trees
- Equality on trees
- Equal number of leaves at or above frontier
- Equal number of leaves at each relative level
below frontier - Nodes have outdegree 3
- Node below frontier ?(3)
- (ml1, l2, l3) signature
- Example Signature (2 3, 2, 0)
- 2 C and F are leaves
- 3 G, H, I are 1 level past the frontier
- 2 J and K are 2 levels past the frontier
25Inductive Definition
- Let CSS(m,l1,l2,l3) min cost tree with sig.
(ml1, l2, l3) - Can we define CSS(m,l1,l2,l3) in terms of optimal
substructures? - Which trees, when grown by one level, have
signatures CSS(m,l1,l2,l3)? - Which signatures (m,l1,l2,l3) lead to
(m,l1,l2,l3)
26The other direction
Sig (0 2, 0, 0)
- Growing a tree only affects frontier
- Only l1 affects next level
- Choose leaves
- The remaining nodes are internal
- Choose degree-2 (d2)
- Remaining nodes are degree-3 (d3)
- O(n2) choices
Sig (1 0, 0, 3)
27The original question(warning here be symbols)
- Which (ml1,l2,l3) (ml1,l2,l3)
- l1 and d2 are sufficient
- l1 and d2 are both O(n)
- O(n2) possibilities for (ml1,l2,l3)
- CSS(m,l1,l2,l3) min cost tree with sig. (ml1,
l2, l3) - CSS(m,l1,l2,l3)
cm for 1d2l1n - (cm are the smallest n-m weights)
- CSS(n,0,0,0) cost of optimal tree
- Analysis
- Table size O(n4)
- Each cell takes O(n2) lookups
- O(n6) algorithm
28Lower Bound on Cost
- Lemma H(w)/log(k) is a lower bound on the cost
of an optimal tree - For any k-favorable degree cost ?, with ?1
- G is constraint-free
T
T
T
1
1
1
1
1
1
1
1
1
c(T) c(T)
c(T) H(w)/log(k)
(shannon)
29A Simple Lemma
- Lemma 2 For any tree with m weighted nodes
there exists 1 node (splitter) which, when
removed, divides the tree into subtrees with at
most half the weight of the original tree.
splitter
lt1/2
lt 1/2
lt 1/2
30Aproximation Algorithm
- Let G be a DAG where out-degree of every node ? d
- Choose a spanning tree T from G
- Balance-Tree(T)
- Find a splitter node in T (Lemma 2)
- Stop if splitter is child of root
- Disconnect the splitter and reconnect it to the
root - root has degree at most d1
- Call Balance-Tree on all subtrees
splitter
Mass of each subtree is at least half of whole
tree
31Approximation Algorithm
- Analysis
- Mass under any node is half of mass under its
grandparent - Path length to leaf with weight wi is -2log(wi)
- Theorem
- O(m)-time O(log(k)?(d1))-approx to optimal
solution - For any DAG G with m nodes and out-degree ? d
- For every k-favorable degree cost ? 1,
-
Upper Bound on Node Cost
Weighted Path Length
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Proposed Problem 1(CSS in constraint-free
graphs, equal leaf weights)
- Question Polytime algorithm for CSS with
- Constraint-free graphs
- Equal leaf weights
- Increasing degree cost
- Good News
- Characterizations for linear and log degree costs
- Near linear time algorithms for r-ary Varn Codes
(Huffman codes with r unequal letter costs,
uniform probability distribution)
36Varn Codes(infinite lopsided tree)
Symbol Costs (3,3,3,8,8)
5 Leaves
Note Not the 5 highest Leaves!
37Varn Codes(infinite lopsided tree)
Symbol Costs (3,3,3,8,8)
6 Leaves
Note m internal nodes are the highest m nodes
in the infinite tree
38Proposed Problem 1(CSS in constraint-free
graphs, equal leaf weights)
- Bad News
- No Notion of an infinite lopsided tree in CSS
- Degree change structure change
- Optimal CSS tree is fairly balanced
- Property
- No leaf may appear above the level of any other
internal node - Proof If it were the case, we could switch
branches and decrease the cost of the tree - Intuition There is some k which optimizes
breadth-to-depth tradeoff. The optimal tree
repeats this structure. Fringe requires some
computation time.
39Proposed Problem 2(Dynamic CSS)
- CSS often applies to environments which are
inherently dynamic - Web pages change popularity
- Access patterns change on file systems
- Question Given a CSS tree with property P, how
much time does it take to maintain P after an
update? - P minimum cost, approximation-ratio of min cost
- Restrict attention to
- Integer leaf weights (rational distributions)
- Unit updates
40Proposed Problem 2(Dynamic CSS)
- Good News Knuth (and later Vitter) studied
Dynamic Huffman Codes (DHC) - Motivation One-pass encoding
- Protocol
- Both parties maintain optimal tree for first t
characters - Encode and decode t1 character
- Update tree
- Optimality of tree maintained in time
proportional to encoding
41DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Numbering corresponds to merging in greedy
algorithm
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
42DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
33
11
10
22
What happens if we increase B? Node 4 violates
the Sibling Property
11
9
F
11
11
7
8
6
4
5
6
5
3
5
6
D
E
C
2
4
1
2
A
B
43DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Before updating Exchange current node with node
with highest number having the same weight
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
44DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Before updating Exchange current node with node
with highest number having the same weight
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
45DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
F
11
10
7
8
4
5
6
5
3
5
6
5
D
E
C
2
1
2
3
A
B
46DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
F
11
10
7
8
4
5
6
5
3
5
6
5
D
E
C
2
1
2
3
A
B
47DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
33
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
7
10
11
8
6
5
6
5
F
E
4
5
5
3
3
2
2
C
D
1
A
B
48DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
32
11
10
21
Now, safe to increase B, because it cant be
greater than the next highest!
11
9
7
10
11
8
6
5
6
5
F
E
4
5
5
3
3
2
2
C
D
1
A
B
49DHC Sibling Property
- A binary tree with n leaves is a Huffman tree
iff - The n leaves have nonnegative weights w1wn
- the weight of each internal node is the sum of
the weights of its children - The nodes can be numbered in non-decreasing order
by weight - siblings are numbered consecutively
- common parent has a higher number
33
11
10
21
Now, safe to increase B, because it cant be
greater than the next highest!
12
9
7
10
11
8
6
5
6
6
F
E
4
5
5
3
4
2
2
C
D
1
A
B
50Proposed Problem 2(Dynamic CSS)
- Good News DHC generalizes to k-ary alphabets
- Claim
- DHC is an O(?(k))-approximation for CSS
- ? k-favorable, ?(x)1
- constraint-free graphs
51Proposed Problem 2(Dynamic CSS)
- Bad News DHC doesnt generalize to Huffman
codes with unequal letter costs - Sibling property Greedy algorithm
- Future
- Explore DHC for unequal letter costs
- Maintain approximation ratio in constant degree
graphs in time proportional to the height - (We can do it in linear time already)
52Proposed Problem 3(Category Tree - CT)
- Scenario
- Large reservoir of songs in iTunes
- Song is a vector of categorical values
- Common to search all the songs for the right one
- Question Can we organize the songs by
categories so that the average search time is
minimized?
53Proposed Problem 3(Category Tree - CT)
- Category Tree CT(?,C,S)
- ? is the degree cost
- C(d1,,dm) are the m category sizes
- S is a set of objects drawn from C
- Solution Rooted, oriented tree
- Internal nodes are categories
- Edges are appropriate categorical values
- Leaves are objects
- Optimal solution
- Minimize expected path cost
- Path cost is defined as in CSS
Optimal solution corresponds to an adaptive
ordering of the categories
54Proposed Problem 3(Constrained Category Tree -
CCT)
- Constrained Category Tree CCT(?,C,S)
- ? is the degree cost
- C(d1,,dm) are the m category sizes
- S is a set of objects drawn from C
- Solution Rooted, oriented tree
- Internal nodes are categories (and internal nodes
at the same depth have the same category) - Edges are appropriate categorical values
- Leaves are objects
- Optimal solution
- Minimize expected path cost
- Path cost is defined as in CSS
Optimal solution corresponds to a fixed ordering
of the categories
55Proposed Problem 3(Category Tree - CT)
- CT and CCT are classical Decision Tree problems
Decision Tree (DT) Input m binary tests
T(T1Tm) and n objects O(O1On) Output Binary
tree where internal nodes are Ti and leaves or
Oi Measure Total external path length
- CT and CCT are NP-Complete
- Reduction from Exact Cover by 3-Sets (XC3)
- Resembles hardness proof for Decision Tree
56Proposed Problem 3(Category Tree - CT)
Decision Tree Inference (DTI) Input m
examples T/F labeled binary strings from 0,1n
Output Binary tree where internal nodes are
string positions and leaves are TRUE or FALSE
which is consistent with examples Measure Number
of leaves (i.e. size of tree)
- CT and CCT are not instances of DTI
- DT doesnt easily reduce to DTI
- Most complexity results (lower bounds on
approximations) are for DTI only!
57Timetable
- Solve some subset of open problems
- 1-2 academic years
58Open Problems
- Theorem There is an for any instance (G,?,w) of
CSS where G is constraint free, ? is
k-favorable, maps the positive integers to the
positive integers and is non-decreasing
NO
- Proof
- c(T) c(T) c(T) H(w)/log(k)
- T is optimal tree for CSS cost c
- T is optimal tree for OPC cost c for k symbols
each with weight 1 (i.e. ?(x)1) - H is entropy
59Signatures as Representation
- Different lopsided trees share common
substructure when truncated - Level-i-Truncations Include node iff parent is
at most i - Level-i-Signatures ml1,..,l?(k)
- m is the of leaves level i
- lj is of nodes at level ij
- Cost of Level-i-Truncation
- Exact cost for m leaves
- Cost up to the truncation for the remaining n-m
leaves.
60The Dynamic Programming Table
- Signatures Table entries
- MINml1,..,l?(k) gives min-cost of all
truncated trees with signature ml1,..,l?(k) - O(n?(k)1) entries
- level-i truncation is parent of O(nk-1)
level-(i1) truncation - level-i sig is parent of
- O(nk-1) level-(i1) sigs
- Choose how many nodes at next level will be
internal - Among those, choose how many will be degree 2,
degree 3, , degree k O(nk-1) choices - Consistent ordering of entries
- O(n?(k)k) algorithm MINn0,,0 contains
minimum cost
61- Set of products
- The desired information
- e.g., chef paring knives
- Popularity of products
- Weights
- Hierarchical organization of products into
categories - Single, global category (the root)
- Products are endpoints (leaves)
- General to specific trajectory
62- Adaptive Websites Perkowitz Etzioni
- Page synthesis (novel view) with clustering and
concept learning using access logs - Efficiently find topic of interest (effort)
- Hotlink Assignment Kranakis, Krizanc, Shende,
et. al. - Add k hotlinks per page to minimize expected
distance from root to leaves - Recently pages have fixed cost proportional to
their size - Hotlinks dont change path-cost
- Optimal Prefix-Free Codes Golin Rote
- Min code for n words with r symbols where symbol
ai has cost ci - Resembles CSS without a constraint graph
63Lopsided Trees
- ml1,..,l?(k) MINm,l1,..,l?(k)
- n leaves so at most O(n?(k)1) entries
- Entry stores minimum cost of tree bearing that
signature - Total ordering on signatures, consistent with the
growing process - O(nk-1) choices
- O(n?(k)k) algorithm
64Lopsided Trees
- Tree cost at Level 5 in terms of Tree cost at
Level 4 - Cost at Level 5 11/32/313/3
- Cost at Level 6 13/31/229/6
65The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
-
66The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
- Suppose we know
- l1 (the of nodes one level below the frontier)
- d2 (the of l1 which are degree-2 nodes in
(m,l1,l2,l3)) - Lets determine the values of the remaining
variables -
1
1
2
2
3
d2 nodes
l1 nodes
3
67The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
- Suppose we know
- l1 (the of nodes one level below the frontier)
- d2 (the of l1 which are degree-2 nodes in
(m,l1,l2,l3))
The old number of leaves
Internal nodes of degree 2
m m l1 - d2 - d3
Nodes at one level below the frontier
Internal nodes of degree 3
The new number of leaves
68The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
- Suppose we know
- l1 (the of nodes one level below the frontier)
- d2 (the of l1 which are degree-2 nodes in
(m,l1,l2,l3))
The old number of leaves
Internal nodes of degree 2
m m l1 - d2 - l3/3
Nodes at one level below the frontier
Internal nodes of degree 3
The new number of leaves
69The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
- Suppose we know
- l1 (the of nodes one level below the frontier)
- d2 (the of l1 which are degree-2 nodes in
(m,l1,l2,l3))
The old number of nodes at 2 levels below the
frontier
New nodes one level below the frontier
l2 l1
70The original question(warning here be symbols)
- Which (m,l1,l2,l3) (m,l1,l2,l3)
- Suppose we know
- l1 (the of nodes one level below the frontier)
- d2 (the of l1 which are degree-2 nodes in
(m,l1,l2,l3))
The new number of nodes 2 levels below the
frontier
d2 nodes are binary so they contribute 2d2 to the
frontier
l2 l32d2
71Organized Data
- Premise People organize data so it is easy to
find - Natural navigation
- Popular items are easily accessible
72Organized Data
- Observation Most existing data could be better
organized - Files clutter folders directory structures lose
consistency - Web pages are buried deep in the website
- Searching takes too much time
73Organized Data
- Question How can we automatically improve access
to organized information?
74Organized Data
- Question How can we automatically improve access
to organized information? - Thesis Goals
- Models for information organization tasks
- Novel deliberation cost
- Computational complexity
- Algorithms and approximations
75Outline
- Prior Work Constrained Subtree Selection
- Definitions k-favorable, constraint-free
- Related work
- Polytime DP algorithm for restricted case
- Other results
- Proposed Future Work
- Dynamic CSS
- Algorithms for open CSS problems
- Category Tree A decision tree problem