Building Optimal Websites with the Constrained Subtree Selection Problem - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Building Optimal Websites with the Constrained Subtree Selection Problem

Description:

Theorem: O(n (k) k)-time DP algorithm. is integer-valued, k-favorable and G is constraint free ... Near linear time algorithms for r-ary Varn Codes (Huffman ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 76
Provided by: brenthe
Category:

less

Transcript and Presenter's Notes

Title: Building Optimal Websites with the Constrained Subtree Selection Problem


1
Building Optimal Websites with the Constrained
Subtree Selection Problem
  • Brent Heeringa
  • (joint work with Micah Adler)
  • 09 November 2004

2
A website design problem(for example a new
kitchen store)
  • Given products, their popularity, and their
    organization
  • How do we create a good website?
  • Navigation is natural
  • Access to information is timely

3
Good website Natural Navigation
  • Organization is a DAG
  • TC of DAG enumerates all viable categorical
    relationships and introduces shortcuts
  • Subgraph of TC preserves logical relationship
    between categories

TC
4
Good website Timely Access to Info
  • Two obstacles to finding info quickly
  • Time scanning a page for correct link
  • Time descending the DAG
  • Associate a cost with each obstacle
  • Page cost (function of out-degree of node)
  • Path cost (sum of page costs on path)
  • Good access structure
  • Minimize expected path cost
  • Optimal subgraph is always a full tree

1/2
Page Cost links Path Cost 325 Weighted
Path Cost 5/2
5
Constrained Subtree Selection (CSS)
  • An instance of CSS is a triple (G,?,w)
  • G is a rooted, DAG with n leaves (constraint
    graph)
  • ? is a function of the out-degree of each
    internal node (degree cost)
  • w is a probability distribution over the n
    leaves (weights)
  • A solution is any directed subtree of the
    transitive closure of G which includes the root
    and leaves
  • An optimal solution is one which minimizes the
    expected path cost

1/4
1/4
1/4
1/4
?(x)x
6
Constrained Subtree Selection (CSS)
  • An instance of CSS is a triple (G,?,w)
  • G is a rooted, DAG with n leaves (constraint
    graph)
  • ? is a function of the out-degree of each
    internal node (degree cost)
  • w is a probability distribution over the n
    leaves (weights)
  • A solution is any directed subtree of the
    transitive closure of G which includes the root
    and leaves
  • An optimal solution is one which minimizes the
    expected path cost

1/4
1/4
1/4
1/4
3(1/4)
5(1/4)
5(1/4)
3(1/4)
?(x)x Cost4
7
Constrained Subtree Selection (CSS)
  • An instance of CSS is a triple (G,?,w)
  • G is a rooted, DAG with n leaves (constraint
    graph)
  • ? is a function of the out-degree of each
    internal node (degree cost)
  • w is a probability distribution over the n
    leaves (weights)
  • A solution is any directed subtree of the
    transitive closure of G which includes the root
    and leaves
  • An optimal solution is one which minimizes the
    expected path cost

1/2
1/6
1/6
1/6
?(x)x Cost 3 1/2
8
Constraint-Free Graphs and k-favorability
  • Constraint-Free Graph
  • Every directed, full tree with n leaves is a
    subtree of the TC
  • CSS is no longer constrained by the graph
  • k-favorable degree cost ?
  • Fix ?. There exists kgt1 for any constraint-free
    instance of CSS under ? where an optimal tree
    has maximal out-degree k

9
Linear Degree Cost - ?(x)x
  • 5 paths w/ cost 5
  • 3 paths w/ cost 5
  • 2 paths w/ cost 4

10
Linear Degree Cost - ?(x)x
gt 1/2
  • Prefer binary structure when a leaf has at least
  • half the mass
  • Prefer ternary structure when mass is
  • uniformly distributed
  • CSS with 2-favorable degree costs and C.F.
    graphs is Huffman coding problem
  • Examples quadratic, exp, ceiling of log

11
Results
  • Complexity NP-Complete for equal weights and
    many ?
  • Sufficient condition on ?
  • Hardness depends on constraint graph
  • Highlighted Results
  • Theorem O(n?(k)k)-time DP algorithm
  • ? is integer-valued, k-favorable and G is
    constraint free
  • ?(x)x
  • Theorem poly-time constant-approximation
  • ?1 and k-favorable G has constant out-degree
  • Approximate Hotlink Assignment - Kranakis et.
    al
  • Other results
  • Characterizations of optimal trees for uniform
    probability distributions

12
Related Work
  • Adaptive Websites Perkowitz Etzioni
  • Challenge to the AI community
  • Novel views of websites Page synthesis problem
  • Hotlink Assignment Kranakis, Krizanc, Shende,
    et. al.
  • Add 1 hotlink per page to minimize expected
    distance from root to leaves
  • Recently pages have cost proportional to their
    size
  • Hotlinks dont change page cost
  • Optimal Prefix-Free Codes Golin Rote
  • Min code for n words with r symbols where symbol
    ai has cost ci
  • Resembles CSS without a constraint graph

13
Exact Cover by 3-Sets
INPUT (X,C) X(x1,,xn) n3k and C(C1,,Cm) Ci
? X OUTPUT C ? C where Ck and covers X
QUESTION Given K and (X,C) is there a cover of
size K?
Sufficient condition on ? For every integer
k, there exists an integer s(k) such that
14
(X,C) X(x1,,xn) n3k and C(C1,,Cm) Ci ? X
15
Lopsided Trees
  • Recall ?(x)x, and G is constraint free
  • Node level path cost
  • Adding an edge increases level
  • Grow lopsided trees level by level

16
Lopsided Trees
17
Lopsided Trees
18
Lopsided Trees
19
Lopsided Trees
  • We know exact cost of tree up to the current
    level i
  • Exact cost of m leaves
  • Remaining n-m leaves must have path-cost at
    least i

20
Lopsided Trees
  • Exact cost of C 3 (1/3)1
  • Remaining mass up to level 4 (2/3) 4 8/3
  • Total 18/311/3

21
Lopsided Trees
  • Tree cost at Level 5 in terms of Tree cost at
    Level 4
  • Add in the mass of remaining leaves
  • Cost at Level 5
  • No new leaves
  • 11/32/313/3

22
Lopsided Trees
23
Lopsided Trees
24
Lopsided Trees
  • Equality on trees
  • Equal number of leaves at or above frontier
  • Equal number of leaves at each relative level
    below frontier
  • Nodes have outdegree 3
  • Node below frontier ?(3)
  • (ml1, l2, l3) signature
  • Example Signature (2 3, 2, 0)
  • 2 C and F are leaves
  • 3 G, H, I are 1 level past the frontier
  • 2 J and K are 2 levels past the frontier

25
Inductive Definition
  • Let CSS(m,l1,l2,l3) min cost tree with sig.
    (ml1, l2, l3)
  • Can we define CSS(m,l1,l2,l3) in terms of optimal
    substructures?
  • Which trees, when grown by one level, have
    signatures CSS(m,l1,l2,l3)?
  • Which signatures (m,l1,l2,l3) lead to
    (m,l1,l2,l3)

26
The other direction
Sig (0 2, 0, 0)
  • Growing a tree only affects frontier
  • Only l1 affects next level
  • Choose leaves
  • The remaining nodes are internal
  • Choose degree-2 (d2)
  • Remaining nodes are degree-3 (d3)
  • O(n2) choices

Sig (1 0, 0, 3)
27
The original question(warning here be symbols)
  • Which (ml1,l2,l3) (ml1,l2,l3)
  • l1 and d2 are sufficient
  • l1 and d2 are both O(n)
  • O(n2) possibilities for (ml1,l2,l3)
  • CSS(m,l1,l2,l3) min cost tree with sig. (ml1,
    l2, l3)
  • CSS(m,l1,l2,l3)
    cm for 1d2l1n
  • (cm are the smallest n-m weights)
  • CSS(n,0,0,0) cost of optimal tree
  • Analysis
  • Table size O(n4)
  • Each cell takes O(n2) lookups
  • O(n6) algorithm

28
Lower Bound on Cost
  • Lemma H(w)/log(k) is a lower bound on the cost
    of an optimal tree
  • For any k-favorable degree cost ?, with ?1
  • G is constraint-free

T
T
T
1
1
1
1
1
1
1
1
1
c(T) c(T)
c(T) H(w)/log(k)
(shannon)
29
A Simple Lemma
  • Lemma 2 For any tree with m weighted nodes
    there exists 1 node (splitter) which, when
    removed, divides the tree into subtrees with at
    most half the weight of the original tree.

splitter
lt1/2
lt 1/2
lt 1/2
30
Aproximation Algorithm
  • Let G be a DAG where out-degree of every node ? d
  • Choose a spanning tree T from G
  • Balance-Tree(T)
  • Find a splitter node in T (Lemma 2)
  • Stop if splitter is child of root
  • Disconnect the splitter and reconnect it to the
    root
  • root has degree at most d1
  • Call Balance-Tree on all subtrees

splitter
Mass of each subtree is at least half of whole
tree
31
Approximation Algorithm
  • Analysis
  • Mass under any node is half of mass under its
    grandparent
  • Path length to leaf with weight wi is -2log(wi)
  • Theorem
  • O(m)-time O(log(k)?(d1))-approx to optimal
    solution
  • For any DAG G with m nodes and out-degree ? d
  • For every k-favorable degree cost ? 1,

Upper Bound on Node Cost
Weighted Path Length
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Proposed Problem 1(CSS in constraint-free
graphs, equal leaf weights)
  • Question Polytime algorithm for CSS with
  • Constraint-free graphs
  • Equal leaf weights
  • Increasing degree cost
  • Good News
  • Characterizations for linear and log degree costs
  • Near linear time algorithms for r-ary Varn Codes
    (Huffman codes with r unequal letter costs,
    uniform probability distribution)

36
Varn Codes(infinite lopsided tree)
Symbol Costs (3,3,3,8,8)
5 Leaves
Note Not the 5 highest Leaves!
37
Varn Codes(infinite lopsided tree)
Symbol Costs (3,3,3,8,8)
6 Leaves
Note m internal nodes are the highest m nodes
in the infinite tree
38
Proposed Problem 1(CSS in constraint-free
graphs, equal leaf weights)
  • Bad News
  • No Notion of an infinite lopsided tree in CSS
  • Degree change structure change
  • Optimal CSS tree is fairly balanced
  • Property
  • No leaf may appear above the level of any other
    internal node
  • Proof If it were the case, we could switch
    branches and decrease the cost of the tree
  • Intuition There is some k which optimizes
    breadth-to-depth tradeoff. The optimal tree
    repeats this structure. Fringe requires some
    computation time.

39
Proposed Problem 2(Dynamic CSS)
  • CSS often applies to environments which are
    inherently dynamic
  • Web pages change popularity
  • Access patterns change on file systems
  • Question Given a CSS tree with property P, how
    much time does it take to maintain P after an
    update?
  • P minimum cost, approximation-ratio of min cost
  • Restrict attention to
  • Integer leaf weights (rational distributions)
  • Unit updates

40
Proposed Problem 2(Dynamic CSS)
  • Good News Knuth (and later Vitter) studied
    Dynamic Huffman Codes (DHC)
  • Motivation One-pass encoding
  • Protocol
  • Both parties maintain optimal tree for first t
    characters
  • Encode and decode t1 character
  • Update tree
  • Optimality of tree maintained in time
    proportional to encoding

41
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Numbering corresponds to merging in greedy
algorithm
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
42
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

33
11
10
22
What happens if we increase B? Node 4 violates
the Sibling Property
11
9
F
11
11
7
8
6
4
5
6
5
3
5
6
D
E
C
2
4
1
2
A
B
43
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Before updating Exchange current node with node
with highest number having the same weight
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
44
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Before updating Exchange current node with node
with highest number having the same weight
11
9
F
11
10
7
8
5
4
5
6
5
3
5
6
D
E
C
2
3
1
2
A
B
45
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
F
11
10
7
8
4
5
6
5
3
5
6
5
D
E
C
2
1
2
3
A
B
46
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
F
11
10
7
8
4
5
6
5
3
5
6
5
D
E
C
2
1
2
3
A
B
47
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

33
11
10
21
Different, but still optimal, greedy choice when
merging nodes
11
9
7
10
11
8
6
5
6
5
F
E
4
5
5
3
3
2
2
C
D
1
A
B
48
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

32
11
10
21
Now, safe to increase B, because it cant be
greater than the next highest!
11
9
7
10
11
8
6
5
6
5
F
E
4
5
5
3
3
2
2
C
D
1
A
B
49
DHC Sibling Property
  • A binary tree with n leaves is a Huffman tree
    iff
  • The n leaves have nonnegative weights w1wn
  • the weight of each internal node is the sum of
    the weights of its children
  • The nodes can be numbered in non-decreasing order
    by weight
  • siblings are numbered consecutively
  • common parent has a higher number

33
11
10
21
Now, safe to increase B, because it cant be
greater than the next highest!
12
9
7
10
11
8
6
5
6
6
F
E
4
5
5
3
4
2
2
C
D
1
A
B
50
Proposed Problem 2(Dynamic CSS)
  • Good News DHC generalizes to k-ary alphabets
  • Claim
  • DHC is an O(?(k))-approximation for CSS
  • ? k-favorable, ?(x)1
  • constraint-free graphs

51
Proposed Problem 2(Dynamic CSS)
  • Bad News DHC doesnt generalize to Huffman
    codes with unequal letter costs
  • Sibling property Greedy algorithm
  • Future
  • Explore DHC for unequal letter costs
  • Maintain approximation ratio in constant degree
    graphs in time proportional to the height
  • (We can do it in linear time already)

52
Proposed Problem 3(Category Tree - CT)
  • Scenario
  • Large reservoir of songs in iTunes
  • Song is a vector of categorical values
  • Common to search all the songs for the right one
  • Question Can we organize the songs by
    categories so that the average search time is
    minimized?

53
Proposed Problem 3(Category Tree - CT)
  • Category Tree CT(?,C,S)
  • ? is the degree cost
  • C(d1,,dm) are the m category sizes
  • S is a set of objects drawn from C
  • Solution Rooted, oriented tree
  • Internal nodes are categories
  • Edges are appropriate categorical values
  • Leaves are objects
  • Optimal solution
  • Minimize expected path cost
  • Path cost is defined as in CSS

Optimal solution corresponds to an adaptive
ordering of the categories
54
Proposed Problem 3(Constrained Category Tree -
CCT)
  • Constrained Category Tree CCT(?,C,S)
  • ? is the degree cost
  • C(d1,,dm) are the m category sizes
  • S is a set of objects drawn from C
  • Solution Rooted, oriented tree
  • Internal nodes are categories (and internal nodes
    at the same depth have the same category)
  • Edges are appropriate categorical values
  • Leaves are objects
  • Optimal solution
  • Minimize expected path cost
  • Path cost is defined as in CSS

Optimal solution corresponds to a fixed ordering
of the categories
55
Proposed Problem 3(Category Tree - CT)
  • CT and CCT are classical Decision Tree problems

Decision Tree (DT) Input m binary tests
T(T1Tm) and n objects O(O1On) Output Binary
tree where internal nodes are Ti and leaves or
Oi Measure Total external path length
  • CT and CCT are NP-Complete
  • Reduction from Exact Cover by 3-Sets (XC3)
  • Resembles hardness proof for Decision Tree

56
Proposed Problem 3(Category Tree - CT)
Decision Tree Inference (DTI) Input m
examples T/F labeled binary strings from 0,1n
Output Binary tree where internal nodes are
string positions and leaves are TRUE or FALSE
which is consistent with examples Measure Number
of leaves (i.e. size of tree)
  • CT and CCT are not instances of DTI
  • DT doesnt easily reduce to DTI
  • Most complexity results (lower bounds on
    approximations) are for DTI only!

57
Timetable
  • Solve some subset of open problems
  • 1-2 academic years

58
Open Problems
  • Theorem There is an for any instance (G,?,w) of
    CSS where G is constraint free, ? is
    k-favorable, maps the positive integers to the
    positive integers and is non-decreasing

NO
  • Proof
  • c(T) c(T) c(T) H(w)/log(k)
  • T is optimal tree for CSS cost c
  • T is optimal tree for OPC cost c for k symbols
    each with weight 1 (i.e. ?(x)1)
  • H is entropy

59
Signatures as Representation
  • Different lopsided trees share common
    substructure when truncated
  • Level-i-Truncations Include node iff parent is
    at most i
  • Level-i-Signatures ml1,..,l?(k)
  • m is the of leaves level i
  • lj is of nodes at level ij
  • Cost of Level-i-Truncation
  • Exact cost for m leaves
  • Cost up to the truncation for the remaining n-m
    leaves.

60
The Dynamic Programming Table
  • Signatures Table entries
  • MINml1,..,l?(k) gives min-cost of all
    truncated trees with signature ml1,..,l?(k)
  • O(n?(k)1) entries
  • level-i truncation is parent of O(nk-1)
    level-(i1) truncation
  • level-i sig is parent of
  • O(nk-1) level-(i1) sigs
  • Choose how many nodes at next level will be
    internal
  • Among those, choose how many will be degree 2,
    degree 3, , degree k O(nk-1) choices
  • Consistent ordering of entries
  • O(n?(k)k) algorithm MINn0,,0 contains
    minimum cost

61
  • Set of products
  • The desired information
  • e.g., chef paring knives
  • Popularity of products
  • Weights
  • Hierarchical organization of products into
    categories
  • Single, global category (the root)
  • Products are endpoints (leaves)
  • General to specific trajectory

62
  • Adaptive Websites Perkowitz Etzioni
  • Page synthesis (novel view) with clustering and
    concept learning using access logs
  • Efficiently find topic of interest (effort)
  • Hotlink Assignment Kranakis, Krizanc, Shende,
    et. al.
  • Add k hotlinks per page to minimize expected
    distance from root to leaves
  • Recently pages have fixed cost proportional to
    their size
  • Hotlinks dont change path-cost
  • Optimal Prefix-Free Codes Golin Rote
  • Min code for n words with r symbols where symbol
    ai has cost ci
  • Resembles CSS without a constraint graph

63
Lopsided Trees
  • ml1,..,l?(k) MINm,l1,..,l?(k)
  • n leaves so at most O(n?(k)1) entries
  • Entry stores minimum cost of tree bearing that
    signature
  • Total ordering on signatures, consistent with the
    growing process
  • O(nk-1) choices
  • O(n?(k)k) algorithm

64
Lopsided Trees
  • Tree cost at Level 5 in terms of Tree cost at
    Level 4
  • Cost at Level 5 11/32/313/3
  • Cost at Level 6 13/31/229/6

65
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)

66
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)
  • Suppose we know
  • l1 (the of nodes one level below the frontier)
  • d2 (the of l1 which are degree-2 nodes in
    (m,l1,l2,l3))
  • Lets determine the values of the remaining
    variables

1
1
2
2
3
d2 nodes
l1 nodes
3
67
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)
  • Suppose we know
  • l1 (the of nodes one level below the frontier)
  • d2 (the of l1 which are degree-2 nodes in
    (m,l1,l2,l3))

The old number of leaves
Internal nodes of degree 2
m m l1 - d2 - d3
Nodes at one level below the frontier
Internal nodes of degree 3
The new number of leaves
68
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)
  • Suppose we know
  • l1 (the of nodes one level below the frontier)
  • d2 (the of l1 which are degree-2 nodes in
    (m,l1,l2,l3))

The old number of leaves
Internal nodes of degree 2
m m l1 - d2 - l3/3
Nodes at one level below the frontier
Internal nodes of degree 3
The new number of leaves
69
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)
  • Suppose we know
  • l1 (the of nodes one level below the frontier)
  • d2 (the of l1 which are degree-2 nodes in
    (m,l1,l2,l3))

The old number of nodes at 2 levels below the
frontier
New nodes one level below the frontier
l2 l1
70
The original question(warning here be symbols)
  • Which (m,l1,l2,l3) (m,l1,l2,l3)
  • Suppose we know
  • l1 (the of nodes one level below the frontier)
  • d2 (the of l1 which are degree-2 nodes in
    (m,l1,l2,l3))

The new number of nodes 2 levels below the
frontier
d2 nodes are binary so they contribute 2d2 to the
frontier
l2 l32d2
71
Organized Data
  • Premise People organize data so it is easy to
    find
  • Natural navigation
  • Popular items are easily accessible

72
Organized Data
  • Observation Most existing data could be better
    organized
  • Files clutter folders directory structures lose
    consistency
  • Web pages are buried deep in the website
  • Searching takes too much time

73
Organized Data
  • Question How can we automatically improve access
    to organized information?

74
Organized Data
  • Question How can we automatically improve access
    to organized information?
  • Thesis Goals
  • Models for information organization tasks
  • Novel deliberation cost
  • Computational complexity
  • Algorithms and approximations

75
Outline
  • Prior Work Constrained Subtree Selection
  • Definitions k-favorable, constraint-free
  • Related work
  • Polytime DP algorithm for restricted case
  • Other results
  • Proposed Future Work
  • Dynamic CSS
  • Algorithms for open CSS problems
  • Category Tree A decision tree problem
Write a Comment
User Comments (0)
About PowerShow.com