Title: Harmonic Analysis in Learning Theory
1Harmonic Analysis in Learning Theory
- Jeff Jackson
- Duquesne University
2Themes
- Harmonic analysis is central to learning
theoretic results in wide variety of models - Results generally strongest known for learning
with respect to uniform distribution - Work on learning problems has led to some new
harmonic results - Spectral properties of Boolean function classes
- Algorithms for approximating Boolean functions
3Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
4Circuit Classes
- Constant-depth AND/OR circuits (AC0 without the
polynomial-size restriction call this
CDC) - DNF depth-2 circuit with OR at root
Ù
Ú
Ú
Ú
d levels
. . .
. . .
Ù
Ù
Ù
. . .
. . .
. . .
v1 v2 v3
vn
Negations allowed
5Decision Trees
v3
v2
v1
0
1
v4
0
0
1
6Decision Trees
v3
x3 0
v2
v1
0
1
v4
0
0
1
x 11001
7Decision Trees
v3
v2
v1
x1 1
0
1
v4
0
0
1
x 11001
8Decision Trees
v3
v2
v1
0
1
v4
0
0
1
x 11001 f(x) 1
9Function Size
- Each function representation has a natural size
measure - CDC, DNF of gates
- DT of leaves
- Size sF (f) of f with respect to class F is size
of smallest representation of f within F - For all Boolean f, sCDC(f) sDNF(f) sDT(f)
10Efficient Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Time poly(n,sF ,1/e)
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
11Harmonic-Based Uniform Learning
- LMN constant-depth circuits are
quasi-efficiently (n polylog(s/e)-time) uniform
learnable - BT monotone Boolean functions are uniform
learnable in time roughly 2vn logn - Monotone For all x, i f(xxi0) f(xxi1)
- Also exponential in 1/e (so assumes e constant)
- But independent of any size measure
12Notation
- Assume f 0,1n ? -1,1
- For all a in 0,1n, ?a (x) (-1) a x
- For all a in 0,1n, Fourier coefficient f(a) of
f at a is - Sometimes write, e.g., f(1) for f(100)
13Fourier Properties of Classes
- LMN f is a constant-depth circuit of depth d
andS a a lt logd(s/e) ( a of 1s
in a ) - BTf is a monotone Boolean function andS
a a lt vn / e)
14Spectral Properties
15Proof Techniques
- LMN Hastads Switching Lemma harmonic
analysis - BT Based on KKL
- Define AS(f) n Prx,if(xxi0) ? f(xxi1)
- If S a a lt AS(f)/e then SaÏS f2(a) lt e
- For monotone f, harmonic analysis
Cauchy-Schwartz shows AS(f) vn - Note This is tight for MAJ
16Function Approximation
- For all Boolean f,
- For S Í 0,1n, define
- LMN
17The Fourier Learning Algorithm
- Given e (and perhaps s, d)
- Determine k such that for S a a lt k,
SaÏS f2(a) lt e - Draw sufficiently large sample of examples
ltx,f(x)gt to closely estimate f(a) for all aÎS - Chernoff bounds nk/e sample size sufficient
- Output h sign(SaÎS f(a) ?a)
- Run time n2k/e
18Halfspaces
- KOS Halfspaces are efficiently uniform
learnable (given e is constant) - Halfspace wÎRn1 s.t. f(x) sign(w (xº1))
- If S a a lt (21/e)2 then åaÏS f2(a) lt e
- Apply LMN algorithm
- Similar result applies for arbitrary function
applied to constant number of halfspaces - Intersection of halfspaces key learning pblm
19Halfspace Techniques
- O (cf. BKS, BJTa)
- Noise sensitivity of f at ? is probability that
corrupting each bit of x with probability ?
changes f(x) - NS? (f) ½(1-åa(1-2 ?)a f2(a))
- KOS
- If S a a lt 1/ ? then åaÏS f2(a) lt 3 NS?
(f) - If f is halfspace then NSe lt 9v e
20Monotone DT
- OS Monotone functions are efficiently
learnable given - e is constant
- sDT(f) is used as the size measure
- Techniques
- Harmonic analysis for monotone f, AS(f) vlog
sDT(f) - BT If S a a lt AS(f)/e then SaÏS f2(a)
lt e - Friedgut T 2AS(f)/e s.t. SAËT f2(A) lt e
21Weak Approximators
- KKL also show that if f is monotone,there is an
i such that -f(i) log2n/n - Therefore Prf(x) -?i(x) ½ log2n/2n
- In general, h s.t. Prf h ½ 1/poly(n,s) is
called a weak approximator to f - If A outputs a weak approximator for every f in
F , then F is weakly learnable
22Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
23Weak Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt ½ - 1/p(n,s)
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
24Efficient Weak Learning Algorithm for Monotone
Boolean Functions
- Draw set of n2 examples ltx,f(x)gt
- For i 1 to n
- Estimate f(i)
- Output h argmaxf(i)(-?i)
25Weak Approximation for MAJ of Constant-Depth
Circuits
- Note that adding a single MAJ to a CDC destroys
the LMN spectral property - JKS MAJ of CDCs is quasi-efficiently
quasi-weak uniform learnable - If f is a MAJ of CDCs of depth d, and if the
number of gates in f is s, then there is a set A
Í 0,1n such that - A lt logd s k
- Prf(x) ?A(x) ½ 1/4snk
26Weak Learning Algorithm
- Compute k logds
- Draw snk examples ltx,f(x)gt
- Repeat for A lt k
- Estimate f(A)
- Until find A s.t. f(A) gt 1/2snk
- Output h ?A
- Run time npolylog(s)
27Weak ApproximatorProof Techniques
- Discriminator Lemma (HMPST)
- Implies one of the CDCs is a weak approximator
to f - LMN spectral characterization of CDC
- Harmonic analysis
- Beigel result used to extend weak learning to CDC
with polylog MAJ gates
28Boosting
- In many (not all) cases, uniform weak learning
algorithms can be converted to uniform (strong)
learning algorithms using a boosting technique
(S, F, ) - Need to learn weakly with respect to near-uniform
distributions - For near-uniform distribution D, find weak hj
s.t. PrxDhj f gt ½ 1/poly(n,s) - Final h typically MAJ of weak approximators
29Strong Learning for MAJ of Constant-Depth
Circuits
- JKS MAJ of CDC is quasi-efficiently uniform
learnable - Show that for near-uniform distributions, some
parity function is a weak approximator - Beigel result again extends to CDC with poly-log
MAJ gates - KP boosting there are distributions for
which no parity is a weak approximator
30Uniform Learning from a Membership Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Membership OracleMEM(f)
Learning AlgorithmA
x
f(x)
Accuracy e gt 0
31Uniform Membership Learning of Decision Trees
- KM
- L1(f) åa f(a) sDT(f)
- If S a f(a) e/L1(f) then SaÏS f2(a) lt e
- GL Algorithm (memberhip oracle) for finding a
f(a) ? in time n/?6 - So can efficiently uniform membership learn DT
- Output h same form as LMNh sign(SaÎS f(a) ?a)
32Uniform Membership Learning of DNF
- J
- "(distributions D) ?a s.t. PrxDf(x)
?a(x) ½ 1/6sDNF - Modified GL can efficiently locate such ?a
given oracle for near-uniform D - Boosters can provide such an oracle when uniform
learning - Boosting provides strong learning
- BJTb (see also KS)
- Modified Levin algo finds ?a in time ns2
33Uniform Learning from a Classification Noise
Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Classification Noise OracleEX? (f)
Learning AlgorithmA
Uniform random x
Prltx, f(x)gt1-? Prltx, -f(x)gt?
Accuracy e gt 0
Error rate ? gt 0
34Uniform Learning from a Statistical Query Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Statistical Query OracleSQ(f)
Learning AlgorithmA
( q(), t )
EUq(x, f(x)) t
Accuracy e gt 0
35SQ and Classification Noise Learning
- K
- If F is uniform SQ learnable in time poly(n,
sF ,1/e, 1/t) then F is uniform CN learnable in
time poly(n, sF ,1/e, 1/t, 1/(1-2?)) - Empirically, almost always true that if F is
efficiently uniform learnable then F is
efficiently uniform SQ learnable (i.e., 1/t poly
in other parameters) - Exception F PARn ?a a Î 0,1n, a n
36Uniform SQ Hardness for PAR
- BFJKMR
- Harmonic analysis shows that for any q, ?a
EUq(x, ?a(x)) q(0n1) q(a º 1) - Thus adversarial SQ response to (q,t) is q(0n1)
whenever q(a º 1) lt t - Parseval q(b º 1) lt t for all but 1/t2 Fourier
coefficients - So bad query eliminates only poly coefficients
- Even PARlog n not efficiently SQ learnable
37Uniform Learning from an Attribute Noise Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Attribute Noise OracleEXDN(f)
Learning AlgorithmA
Uniform random x
ltxÅr, f(x)gt, rDN
Accuracy e gt 0
Noise model DN
38Uniform Learning with Independent Attribute Noise
- BJTa
- LMN algorithm produces estimates of f(a)
ErDN?a(r) - Example application
- Assume noise process DN is a product
distribution - DN(x) ?i (pi(xi) (1-pi)(1-xi))
- Assume pi lt 1/polylog n, 1/e at most
quasi-poly(n) (mild restrictions) - Then modified LMN uniform learns attribute noisy
AC0 in quasi-poly time
39Agnostic Learning Model
Arbitrary Boolean Function
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) minimized
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
40Near-Agnostic Learning via LMN
- KKM
- Let f be an arbitrary Boolean function
- Fix any set S Í 1..n and fix e
- Let g be any function s.t.
- SaÏS g2(a) lt e and
- Prf ? g is minimized (call this ?)
- Then for h learned by LMN by estimating
coefficients of f over S - Prf ? h lt 4? e
41Average Case Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
D -randomf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
42Average Case Learning of DT
- JSa
- D uniform over complete, non-redundantlog-depth
DTs - DT efficiently uniform learnable on average
- Output is a DT (proper learning)
43Average Case Learning of DT
- Technique
- KM All Fourier coefficients of DT with min
depth d are rational with denominator 2d - In average-case tree, coefficient f(i) for at
least one variable vi has odd numerator - So log(denominator) is min depth of tree
- Try all variables at root and find depth of child
trees, choosing root with shallowest children - Recurse on child trees to choose their roots
44Average Case Learning of DNF
- JSb
- D s terms, each term uniform from terms of
length log s - Monotone DNF with ltn2 terms and DNF with ltn1.5
terms properly and efficiently uniform learnable
on average - Harmonic property
- In average-case DNF, sign of f(i,j) (usually)
indicates whether vi and vj are in a common term
or not
45Summary
- Most uniform-learning results depend on harmonic
analysis - Learning theory provides motivation for new
harmonic observations - Even very weak harmonic results can be useful
in learning-theory algorithms
46Some Open Problems
- Efficient uniform learning of monotone DNF
- Best to date for small sDNF is S, time nslog
s (based on BT, M, LMN) - Non-uniform learning
- Relatively easy to extend many results to product
distributions, e.g. FJS extends LMN - Key issue in real-world applicability
47Open Problems (contd)
- Weaker dependence on e
- Several algorithms fully exponential (or worse)
in 1/e - Additional proper learning results
- Allows for interpretation of learned hypothesis