Title: Fourier Analysis and Boolean Function Learning
1Fourier Analysis and Boolean Function Learning
- Jeff Jackson
- Duquesne University
- www.mathcs.duq.edu/jackson
2Themes
- Fourier analysis is central to learning theoretic
results in wide variety of models - Results generally are the strongest known for
learning Boolean function classes with respect to
uniform distribution - Work on learning problems has led to some new
harmonic results - Spectral properties of Boolean function classes
- Algorithms for approximating Boolean functions
3Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
4Circuit Classes
- Constant-depth AND/OR circuits (AC0 without the
polynomial-size restriction call this
CDC) - DNF depth-2 circuit with OR at root
Ù
Ú
Ú
Ú
d levels
. . .
. . .
Ù
Ù
Ù
. . .
. . .
. . .
v1 v2 v3
vn
Negations allowed
5Decision Trees
v3
v2
v1
0
1
v4
0
0
1
6Decision Trees
v3
x3 0
v2
v1
0
1
v4
0
0
1
x 11001
7Decision Trees
v3
v2
v1
x1 1
0
1
v4
0
0
1
x 11001
8Decision Trees
v3
v2
v1
0
1
v4
0
0
1
x 11001 f(x) 1
9Function Size
- Each function representation has a natural size
measure - CDC, DNF of gates
- DT of leaves
- Size sF (f) of f with respect to class F is size
of smallest representation of f within F - For all Boolean f, sCDC(f) sDNF(f) sDT(f)
10Efficient Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Time poly(n,sF ,1/e)
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
11Harmonic-Based Uniform Learning
- LMN constant-depth circuits are
quasi-efficiently (n polylog(s/e)-time) uniform
learnable - BT monotone Boolean functions are uniform
learnable in time roughly 2vn logn - Monotone For all x, i f(xxi0) f(xxi1)
- Also exponential in 1/e (so assumes e constant)
- But independent of any size measure
12Notation
- Assume f 0,1n ? -1,1
- For all a in 0,1n, ?a (x) (-1) a x
- For all a in 0,1n, Fourier coefficient f(a) of
f at a is - Sometimes write, e.g., f(1) for f(100)
13Fourier Properties of Classes
- LMN f is a constant-depth circuit of depth d
andS a a lt logd(s/e) ( a of 1s
in a ) - BTf is a monotone Boolean function andS
a a lt vn / e)
14Spectral Properties
15Proof Techniques
- LMN Hastads Switching Lemma harmonic
analysis - BT Based on KKL
- Define AS(f) n Prx,if(xxi0) ? f(xxi1)
- If S a a lt AS(f)/e then SaÏS f2(a) lt e
- For monotone f, harmonic analysis
Cauchy-Schwartz shows AS(f) vn - Note This is tight for MAJ
16Function Approximation
- For all Boolean f,
- For S Í 0,1n, define
- LMN
17The Fourier Learning Algorithm
- Given e (and perhaps s, d, ...)
- Determine k such that for S a a lt k,
SaÏS f2(a) lt e - Draw sufficiently large sample of examples
ltx,f(x)gt to closely estimate f(a) for all aÎS - Chernoff bounds nk/e sample size sufficient
- Output h sign(SaÎS f(a) ?a)
- Run time n2k/e
18Halfspaces
- KOS Halfspaces are efficiently uniform
learnable (given e is constant) - Halfspace wÎRn1 s.t. f(x) sign(w (xº1))
- If S a a lt (21/e)2 then åaÏS f2(a) lt e
- Apply LMN algorithm
- Similar result applies for arbitrary function
applied to constant number of halfspaces - Intersection of halfspaces key learning pblm
19Halfspace Techniques
- O (cf. BKS, BJTa)
- Noise sensitivity of f at ? is probability that
corrupting each bit of x with probability ?
changes f(x) - NS? (f) ½(1-åa(1-2 ?)a f2(a))
- KOS
- If S a a lt 1/ ? then åaÏS f2(a) lt 3 NS?
(f) - If f is halfspace then NS?(f) lt 9v ?
20Monotone DT
- OS Monotone functions are efficiently
learnable given - e is constant
- sDT(f) is used as the size measure
- Techniques
- Harmonic analysis for monotone f, AS(f) vlog
sDT(f) - BT If S a a lt AS(f)/e then SaÏS f2(a)
lt e - Friedgut T 2AS(f)/e s.t. SAËT f2(A) lt e
21Weak Approximators
- KKL also show that if f is monotone,there is an
i such that -f(i) log2n/n - Therefore Prf(x) -?i(x) ½ log2n/2n
- In general, h s.t. Prf h ½ 1/poly(n,s) is
called a weak approximator to f - If A outputs a weak approximator for every f in
F , then F is weakly learnable
22Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
23Weak Uniform Learning Model
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt ½ - 1/p(n,s)
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
24Efficient Weak Learning Algorithm for Monotone
Boolean Functions
- Draw set of n2 examples ltx,f(x)gt
- For i 1 to n
- Estimate f(i)
- Output h argmaxf(i)(-?i)
25Weak Approximation for MAJ of Constant-Depth
Circuits
- Note that adding a single MAJ to a CDC destroys
the LMN spectral property - JKS MAJ of CDCs is quasi-efficiently
quasi-weak uniform learnable - If f is a MAJ of CDCs of depth d, and if the
number of gates in f is s, then there is a set A
Í 0,1n such that - A lt logd s k
- Prf(x) ?A(x) ½ 1/4snk
26Weak Learning Algorithm
- Compute k logds
- Draw snk examples ltx,f(x)gt
- Repeat for A lt k
- Estimate f(A)
- Until find A s.t. f(A) gt 1/2snk
- Output h ?A
- Run time npolylog(s)
27Weak ApproximatorProof Techniques
- Discriminator Lemma (HMPST)
- Implies one of the CDCs is a weak approximator
to f - LMN spectral characterization of CDC
- Harmonic analysis
- Beigel result used to extend weak learning to CDC
with polylog MAJ gates
28Boosting
- In many (not all) cases, uniform weak learning
algorithms can be converted to uniform (strong)
learning algorithms using a boosting technique
(S, FS, ) - Need to learn weakly with respect to near-uniform
distributions - For near-uniform distribution D, find weak hj
s.t. PrxDhj f gt ½ 1/poly(n,s) - Final h typically MAJ of weak approximators
29Strong Learning for MAJ of Constant-Depth
Circuits
- JKS MAJ of CDC is quasi-efficiently uniform
learnable - Show that for near-uniform distributions, some
parity function is a weak approximator - Beigel result again extends to CDC with poly-log
MAJ gates - KP boosting there are distributions for
which no parity is a weak approximator
30Uniform Learning from a Membership Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Membership OracleMEM(f)
Learning AlgorithmA
x
f(x)
Accuracy e gt 0
31Uniform Membership Learning of Decision Trees
- KM
- L1(f) åa f(a) sDT(f)
- If S a f(a) e/L1(f) then SaÏS f2(a) lt e
- GL Algorithm (memberhip oracle) for finding a
f(a) ? in time n/?6 - So can efficiently uniform membership learn DT
- Output h same form as LMNh sign(SaÎS f(a) ?a)
32Uniform Membership Learning of DNF
- J
- "(distributions D) ?a s.t. PrxDf(x)
?a(x) ½ 1/6sDNF - Modified GL can efficiently locate such ?a
given oracle for near-uniform D - Boosters can provide such an oracle when uniform
learning - Boosting provides strong learning
- BJTb, KS, F
- For near-uniform D, can find ?a in time ns2
33Uniform Learning from a Random Walk Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Random Walk Exampleslt x, f(x) gt
Random Walk Oracle RW(f)
Learning AlgorithmA
Accuracy e gt 0
34Random Walk DNF Learning
- BMOS
- Noise sensitivity and related values can be
accurately estimated using a random walk oracle - NS? (f) ½(1-åa(1-2 ?)a f2(a))
- T?b(f) åa ? b ?a f2(a)
- Estimate of T?b(f) is efficient if b
logarithmic - Only need logarithmic b to learn DNF BF
35Random Walk Parity Learning
- JW (unpub)
- Effectively, BMOS limited to finding heavy
Fourier coefficents f(a) for logarithmic a - Using a breadth-first variation of KM, can
locate any f(a) gt ? in time O(nlog 1/ ?) - Heavy coefficient corresponds to a parity
function that weakly approximates
36Uniform Learning from a Classification Noise
Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Classification Noise OracleEX? (f)
Learning AlgorithmA
Uniform random x
Prltx, f(x)gt1-? Prltx, -f(x)gt?
Accuracy e gt 0
Error rate ? gt 0
37Uniform Learning from a Statistical Query Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Statistical Query OracleSQ(f)
Learning AlgorithmA
( q(), t )
EUq(x, f(x)) t
Accuracy e gt 0
38SQ and Classification Noise Learning
- K
- If F is uniform SQ learnable in time poly(n,
sF ,1/e, 1/t) then F is uniform CN learnable in
time poly(n, sF ,1/e, 1/t, 1/(1-2?)) - Empirically, almost always true that if F is
efficiently uniform learnable then F is
efficiently uniform SQ learnable (i.e., 1/t poly
in other parameters) - Exception F PARn ?a a Î 0,1n, a n
39Uniform SQ Hardness for PAR
- BFJKMR
- Harmonic analysis shows that for any q, ?a
EUq(x, ?a(x)) q(0n1) q(a º 1) - Thus adversarial SQ response to (q,t) is q(0n1)
whenever q(a º 1) lt t - Parseval q(b º 1) lt t for all but 1/t2 Fourier
coefficients - So bad query eliminates only poly coefficients
- Even PARlog n not efficiently SQ learnable
40Uniform Learning from an Attribute Noise Oracle
Boolean Function Class F (e.g., DNF)
Hypothesis h0,1n ? 0,1 s.t. PrxU f(x) ?
h(x) lt e
Target functionf 0,1n ? 0,1
Attribute Noise OracleEXDN(f)
Learning AlgorithmA
Uniform random x
ltxÅr, f(x)gt, rDN
Accuracy e gt 0
Noise model DN
41Uniform Learning with Independent Attribute Noise
- BJTa
- LMN algorithm produces estimates of f(a)
ErDN?a(r) - Example application
- Assume noise process DN is a product
distribution - DN(x) ?i (pixi (1-pi)(1-xi))
- Assume pi lt 1/polylog n, 1/e at most
quasi-poly(n) (mild restrictions) - Then modified LMN uniform learns attribute noisy
AC0 in quasi-poly time
42Agnostic Learning Model
Arbitrary Boolean Function
Hypothesis h in H s.t. PrxU f(x) ? h(x) lt
optH e
Target functionf 0,1n ? 0,1
Uniform Random Exampleslt x, f(x) gt
Example OracleEX(f)
Learning AlgorithmA
Accuracy e gt 0
43Agnostic Learning of Halfspaces
- KKMS
- Agnostic learning algorithm for H the set of
halfspaces - Algorithm is not Fourier-based (L1 regression)
- However, a somewhat weaker result can be obtained
by simple Fourier analysis
44Near-Agnostic Learning via LMN
- KKMS
- Let f be an arbitrary Boolean function
- Fix any set S Í 1..n and fix e
- Let g be any function s.t.
- SaÏS g2(a) lt e and
- Prf ? g (call this ?) is minimized for any such
g - Then for h learned by LMN by estimating
coefficients of f over S - Prf ? h lt 4? e
45Summary
- Most uniform-learning results for Boolean
function classes depend on harmonic analysis - Learning theory provides motivation for new
harmonic observations - Even very weak harmonic results can be useful
in learning-theory algorithms
46Some Open Problems
- Efficient uniform learning of monotone DNF
- Best to date for small sDNF is Ser, time
nslog s (based on BT, M, LMN) - Non-uniform learning
- Relatively easy to extend many results to product
distributions, e.g. FJS extends LMN - Key issue in real-world applicability
47Open Problems (contd)
- Weaker dependence on e
- Several algorithms fully exponential (or worse)
in 1/e - Additional proper learning results
- Allows for interpretation of learned hypothesis
48References
- Beigel When Do Extra Majority Gates Help? ...
- BFJKMR Blum, Furst, Jackson, Kearns, Mansour,
Rudich. Weakly Learning DNF... - BJTa Bshouty, Jackson, Tamon.
Uniform-Distribution Attribute Noise
Learnability. - BJTb Bshouty, Jackson, Tamon. More Efficient
PAC-learning of DNF... - BKS Benjamini, Kalai, Schramm. Noise
Sensitivity of Boolean Functions... - BMOS Bshouty, Mossel, ODonnell, Servedio.
Learning DNF from Random Walks. - BT Bshouty, Tamon. On the Fourier Spectrum of
Monotone Functions. - F Feldman. Attribute Efficient and Non-adaptive
Learning of Parities... - FJS Furst, Jackson, Smith. Improved Learning of
AC0 Functions. - FS Freund, Schapire. A Decision-theoretic
Generalization of On-line Learning... - Friedgut Boolean Functions with Low Average
Sensitivity Depend on Few Coordinates. - HMPST Hajnal, Maass, Pudlak, Szegedy, Turan.
Threshold Circuits of Bounded Depth. - J Jackson. An Efficient Membership-Query
Algorithm for Learning DNF... - JKS Jackson, Klivans, Servedio. Learnability
Beyond AC0. - JW Jackson, Wimmer. In prep.
- KKL Kahn, Kalai, Linial. The Influence of
Variables on Boolean Functions. - KKMS Kalai, Klivans, Mansour, Servedio. On
Agnostic Boosting and Parity Learning. - K Kearns. Efficient Noise-tolerant learning
from Statistical Queries. - KM Kushilevitz, Mansour. Learning Decision
Trees using the Fourier Spectrum.