Title: Agnostically Learning Decision Trees
1Agnostically Learning Decision Trees
Parikshit Gopalan MSR-Silicon Valley,
IITB00.Adam Tauman Kalai MSR-New EnglandAdam
R. Klivans UT Austin
1
0
X1
0
1
1
0
0
1
2Computational Learning
3Computational Learning
4Computational Learning
f0,1n ! 0,1
x, f(x)
Learning Predict f from examples.
5Valiants Model
f0,1n ! 0,1
Halfspaces
-
x, f(x)
-
-
-
-
-
-
-
-
-
-
Assumption f comes from a nice concept class.
6Valiants Model
f0,1n ! 0,1
Decision Trees
X1
x, f(x)
Assumption f comes from a nice concept class.
7The Agnostic Model Kearns-Schapire-Sellie94
f0,1n ! 0,1
Decision Trees
x, f(x)
No assumptions about f. Learner should do as well
as best decision tree.
8The Agnostic Model Kearns-Schapire-Sellie94
Decision Trees
x, f(x)
No assumptions about f. Learner should do as well
as best decision tree.
9Agnostic Model Noisy Learning
f0,1n ! 0,1
- Concept Message
- Truth table Encoding
- Function f Received word.
- Coding Recover the Message.
- Learning Predict f.
10Uniform Distribution Learning for Decision Trees
- Noiseless Setting
- No queries nlog n Ehrenfeucht-Haussler89.
- With queries poly(n). Kushilevitz-Mansour91
Agnostic Setting Polynomial time, uses queries.
G.-Kalai-Klivans08
Reconstruction for sparse real polynomials in the
l1 norm.
11The Fourier Transform Method
- Powerful tool for uniform distribution learning.
- Introduced by Linial-Mansour-Nisan.
- Small depth circuits Linial-Mansour-Nisan89
- DNFs Jackson95
- Decision trees Kushilevitz-Mansour94,
ODonnell-Servedio06, G.-Kalai-Klivans08 - Halfspaces, Intersections Klivans-ODonnell-Serve
dio03, Kalai-Klivans-Mansour-Servedio05 - Juntas Mossel-ODonnell-Servedio03
- Parities Feldman-G.-Khot-Ponnsuswami06
12The Fourier Polynomial
- Let f-1,1n ! -1,1.
- Write f as a polynomial.
- AND ½ ½X1 ½X2 - ½X1X2
- Parity X1X2
- Parity of ? ½ n ??(x) ?i 2 ?Xi
- Write f(x) ?? c(?)??(x)
- ?? c(?)2 1.
Standard Basis Function f Parities
13The Fourier Polynomial
- Let f-1,1n ! -1,1.
- Write f as a polynomial.
- AND ½ ½X1 ½X2 - ½X1X2
- Parity X1X2
- Parity of ? ½ n ??(x) ?i 2 ?Xi
- Write f(x) ?? c(?)??(x)
- ?? c(?)2 1.
c(?)2 Weight of ?.
?
14Low Degree Functions
- Sparse Functions Most of the weight lies on
small subsets. - Halfspaces, Small-depth circuits.
- Low-degree algorithm. Linial-Mansour-Nisan
- Finds the low-degree Fourier coefficients.
Least Squares Regression Find
low-degree P minimizing Ex P(x) f(x)2 .
15Sparse Functions
- Sparse Functions Most of the weight lies on a
few subsets. - Decision trees.
- t leaves ) O(t) subsets
- Sparse Algorithm.
- Kushilevitz-Mansour91
Sparse l2 Regression Find t-sparse P
minimizing Ex P(x) f(x)2 .
16Sparse l2 Regression
- Sparse Functions Most of the weight lies on a
few subsets. - Decision trees.
- t leaves ) O(t) subsets
- Sparse Algorithm.
- Kushilevitz-Mansour91
Sparse l2 Regression Find t-sparse P
minimizing Ex P(x) f(x)2 . Finding large
coefficients Hadamard decoding. Kushilevitz-Mans
our91, Goldreich-Levin89
17Agnostic Learning via l2 Regression?
18Agnostic Learning via l2 Regression?
19Agnostic Learning via l2 Regression?
Target f
Best Tree
- l2 Regression
- Loss P(x) f(x)2
- Pay 1 for indecision.
- Pay 4 for a mistake.
- l1 Regression KKMS05
- Loss P(x) f(x)
- Pay 1 for indecision.
- Pay 2 for a mistake.
20Agnostic Learning via l1 Regression?
- l2 Regression
- Loss P(x) f(x)2
- Pay 1 for indecision.
- Pay 4 for a mistake.
- l1 Regression KKMS05
- Loss P(x) f(x)
- Pay 1 for indecision.
- Pay 2 for a mistake.
21Agnostic Learning via l1 Regression
Target f
Best Tree
Thm KKMS05 l1 Regression always gives a good
predictor. l1 regression for low degree
polynomials via Linear Programming.
22Agnostically Learning Decision Trees
Sparse l1 Regression Find a t-sparse polynomial
P minimizing Ex P(x) f(x) .
- Why is this Harder
- l2 is basis independent, l1 is not.
- Dont know the support of P.
G.-Kalai-Klivans Polynomial time algorithm for
Sparse l1 Regression.
23The Gradient-Projection Method
L1(P,Q) ?? c(?) d(?) L2(P,Q) ?? (c(?)
d(?))21/2
f(x)
P(x) ?? c(?) ??(x)
Q(x) ?? d(?) ??(x)
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
24The Gradient-Projection Method
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
25The Gradient-Projection Method
Projection
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
26The Gradient-Projection Method
Projection
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
27The Gradient
f(x)
P(x)
Increase P(x) if low. Decrease P(x) if high.
- g(x) sgnf(x) - P(x)
- P(x) P(x) ? g(x).
28The Gradient-Projection Method
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
29The Gradient-Projection Method
Projection
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
30Projection onto the L1 ball
Currently ??c(?) gt t Want ??c(?) t.
31Projection onto the L1 ball
Currently ??c(?) gt t Want ??c(?) t.
32Projection onto the L1 ball
- Below cutoff Set to 0.
- Above cutoff Subtract.
33Projection onto the L1 ball
- Below cutoff Set to 0.
- Above cutoff Subtract.
34Analysis of Gradient-Projection Zinkevich03
- Progress measure Squared L2 distance from
optimum P. - Key Equation
- Pt P2 - Pt1 P2 2? (L(Pt) L(P))
- Within ? of optimal in 1/?2 iterations.
- Good L2 approximation to Pt suffices.
?2
Progress made in this step.
How suboptimal current soln is.
35Gradient
f(x)
P(x)
Projection
36The Gradient
f(x)
P(x)
- Compute sparse approximation g KM(g).
- Is g a good L2 approximation to g?
- No. Initially g f.
- L2(g,g) can be as large 1.
37Sparse l1 Regression
Approximate Gradient
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
38Sparse l1 Regression
Projection Compensates
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
39KM as l2 Approximation
The KM Algorithm Input g-1,1n ! -1,1, and
t. Output A t-sparse polynomial g
minimizing Ex g(x) g(x)2 Run Time
poly(n,t).
40KM as L1 Approximation
The KM Algorithm Input A Boolean function g
? c(?)??(x). A error bound ?. Output
Approximation g ? c(?)??(x) s.t c(?)
c(?) ? for all ? ½ n. Run Time poly(n,1/?)
41KM as L1 Approximation
Only 1/?2
- Identify coefficients larger than ?.
- Estimate via sampling, set rest to 0.
42KM as L1 Approximation
- Identify coefficients larger than ?.
- Estimate via sampling, set rest to 0.
43Projection Preserves L1 Distance
L1 distance at most 2? after projection. Both
lines stop within ? of each other.
44Projection Preserves L1 Distance
L1 distance at most 2? after projection. Both
lines stop within ? of each other. Else, Blue
dominates Red.
45Projection Preserves L1 Distance
L1 distance at most 2? after projection. Projectin
g onto the L1 ball does not increase L1 distance.
46Projection Preserves L1 Distance
L1 distance at most 2? after projection. Projectin
g onto the L1 ball preserves L1 distance.
47Sparse l1 Regression
- L1(P, P) 2?
- L1(P, P) 2t
- L2(P, P)2 4?t
P
P
Can take ? 1/t2.
Variables c(?)s. Constraint ?? c(?)
t Minimize ExP(x) f(x)
48Agnostically Learning Decision Trees
Sparse L1 Regression Find a sparse polynomial P
minimizing Ex P(x) f(x) .
- G.-Kalai-Klivans08
- Can get within ? of optimum in poly(t,1/?)
iterations. - Algorithm for Sparse l1 Regression.
- First polynomial time algorithm for Agnostically
Learning Sparse Polynomials.
49l1 Regression from l2 Regression
Function f D ! -1,1, Orthonormal Basis
B. Sparse l2 Regression Find a t-sparse
polynomial P minimizing Ex P(x) f(x)2
. Sparse l1 Regression Find a t-sparse
polynomial P minimizing Ex P(x) f(x) .
G.-Kalai-Klivans08 Given solution to l2
Regression, can solve l1 Regression.
50Agnostically Learning DNFs?
- Problem Can we agnostically learn DNFs in
polynomial time? (uniform dist. with queries) - Noiseless Setting Jacksons Harmonic Sieve.
- Implies weak learner for depth-3 circuits.
- Beyond current Fourier techniques.
Thank You!