Title: Noise Tolerant Learning
1Noise Tolerant Learning
- Presented by Aviad Maizels
Based on Noise-tolerant learning, the parity
problem, and the statistical query model \
Avrim Blum, Adam Kalai and Hal Wasserman A
Generalized Birthday problem \ David
Wagner Hard-core predicates for any one way
function \ Goldreich O. and L.A.Levin Simulated
annealing and Boltzmann machines \ Emile Aarts
and Jan Korst
2void Agenda()
- do
- A few sentences about Codes
- The opposite problem
- Learning with noise
- The k-sum problem
- Can we do it faster ??
- Annealing
- while (!understandable)
3void fast_introduction_to_LECC()
- The communication channel may disrupt the
- original data
Proposed solution encode messages to give some
protection against errors.
4void fast_introduction_to_LECC()(Continue
terminology)
Source
Encoder
Channel
msgu1u2uk
codewordx1x2xn
- Linear Codes
- Fixed sized block code
- Additive closure
- Code is tagged using two parameters (n,k)
- k data size
- n encoded word size
noise
5void fast_introduction_to_LECC()(Continue
terminology)
- Systematic code original data appears directly
inside the codeword.
- Generating matrix (G) - a matrix s.t. multiplying
it with a message will output the encoded word. - Num of rows space dimension (k)
- Every codeword can be represented as a linear
combination of Gs rows.
6void fast_introduction_to_LECC()(Continue
terminology)
- Hamming distance the number of places two
vectors differ in - Denoted by dist(x,y)
- Hamming weight the number of places that differ
from zero in a vector - Denoted by wt(x)
- Minimum distance of linear code minimum weight
of any non-zero vector
7void fast_introduction_to_LECC()(Continue
terminology)
Channel
Decoder
Target
received wordx e
msg ??
error vectore1e2en
- Perfect code (t)- Every vector has hamming
distance
8void fast_introduction_to_LECC()(Continue
terminology)
...
- Complete Decoding The acceptance groups around
the codewords together contains all the vectors
of length n
9void the_opposite_problem()
- Decoding linear (n,k) codes in the presence of
random noise when k O(logn) in poly(n)-time. - k O(logn) is trivial
- in !(coding-theory) terms
- Given a finite set of code words (examples) of
length n, their labels and a codeword ,
find\learn the label of , in the presence of
random noise, in poly(n) time.
10void the_opposite_problem()(Continue Main idea)
- Without noise
- Any vector can be written as a linear combination
of previously seen examples. - Deducing the vectors label can be done in the
same way. - So All we need is to find a basis to deduce any
label of a new example. - Qs Is it the same with the presence of noise ??
11void the_opposite_problem()(Continue Main idea)
- Well No.
- Summing examples actually boosts the noise
- Given s examples and a noise rate of ? sum of s examples has a noise rate of
- ½ ½(1-2?)s
- write basis vectors as a sum of small number of
examples and the new sample as a linear
combination of the above.
12void learning_with_noise()
- Concept boolean function over the input space
- Concept class set of concepts
- World model
- There is a fixed noise rate ?
- Fixed probability distribution D over the input
space - The alg. may ask for labeled example (x,l).
- an unknown concept c.
13void learning_with_noise()
- Goal Find an e-approximation of c
- a function h s.t. Prx?Dh(x) c(x) 1-e
- Parity function defined by a corresponding
vector v?0,1n. The function is then given by
the rule
14void learning_with_noise()(Continue
Preliminaries)
- Efficiently learnable Concept class C is E.L. in
the presence of random classification noise under
distribution D if - ? alg A s.t. ? e0, d0, ?0 and ? concept c?C
- A produces an e-approximation of c with
probability at least 1- d when given access to
D-random examples. - A must run in time polynomial in n,1/e,1/ d and
in 1/(1/2- ?).
15void learning_with_noise()(Continue Goal)
- Well show that The length-k parity problem for
noise rate ?time and total size of examples of 2O(k/logk). - Observe the behavior of the noise when were
adding up examples
16void learning_with_noise()(Continue Noise
behavior)
p1 appearing frequency of noisy bit. q1
appearing frequency of correct bit.
1010111
1111011
p2 appearing frequency of noisy bit. q2
appearing frequency of correct bit.
- pi qi 1
- Denote si pi-qi 2pi1 12qi
si?-1,1 - ? p3 p1q2p2q1 q3 p1p2 q1q2
- ? s3 p3q3 s1s2
- ?
17void learning_with_noise()(Continue Idea)
- Main idea Draw much more examples than needed to
find basis vectors as a sum of relatively small
number of examples. - If ?polynomially indistinguishable from random
- We can repeat the process to boost reliability
18void learning_with_noise()(Continue
Definitions)
- A few more definitions
- k ab
- Vi - subspace of 0,1ab consisting of vectors
whose last i blocks are zeroed - i-sample set of independent vectors that are
uniformly distributed over Vi
19void learning_with_noise()(Continue Main
construction)
- Construction Given i-sample of size s, we
construct (i1)-sample of size at least s-2b in
time O(s) - Behold
- i-samplex1,,xs.
- Partition the xs based on the (a-i) block (well
get max 2b partitions). - For each non-empty partition, pick a random
vector, add it to the other vectors on his
partition and then discard the vector. - Result z1,,zm vectors, ms-2b where
- The block (a-i-1) is zeroed out
- zj are independent uniformly distributed over
Vi1
20void learning_with_noise()(Continue Algorithm)
- Algorithm (Finding the 1st bit)
- Ask for a2b labeled examples
- Apply construction (a-1) times to get
(a-1)-sample - There is 1-1/e chance that the vector (1,0,,0)
will be a member of the (a-1)- sample. If its
not there, well do it again with new labeled
examples (expected number of repetitions is
constant) - Note weve written (1,0,,0) as a sum of 2(a-1)
examples, causing the noise rate to boost to
21void learning_with_noise()(Continue
Observations)
- Observations
- We found the first bit of our new sample using
the number of examples and computation time in
poly - We can shift all examples to determine the
remainder bits - Fixing a(1/2)logk and b2k/logk will give the
desired - for a constant noise rate ?.
-
22void the_k_sum_problem()
- The key to improve the above alg is to find a
better way to solve a problem similar to k-sum. - Problem Given k lists L1,,Lk of elements, drawn
uniformly and independently from 0,1n, find
x1?L1,,xk?Lk s.t. - Note a solution to the k-sum problem exists
with good probability if L1L2Lk 2n
(Similar to birthday paradox)
23void the_k_sum_problem()(Continue Wagners
Algorithm - Definitions)
- Preliminary definitions and observations
- Lowl(x) the l LS bits of x
- L1 xl L2 contains all pairs from L1 x L2 that
agree on the l LS bits. - If lowl (x1?x2)0 and lowl (x3?x4)0 then lowl
(x1?x2?x3?x4)0 and Prx1?x2?x3?x402l/2n - Join (xl) operation
- Hash join stores one list and scans through the
other - (L1 L2) steps, O(L1L2) storage
- Merge join sorts scans the two sorted lists
- O(max(L1,L2)log(max(L1,L2))) time
24void the_k_sum_problem()(Continue Wagners
Algorithm Simple case)
- The 4 lists case
- Extends lists until they each contains 2l
elements - Generate a new list L12 of values x1?x2 s.t.
lowl(x1?x2)0 and a new list L34 in the same way - Search for matches between L12 and L34
25void the_k_sum_problem()(Continue Wagners
Algorithm)
- Observation
- Prlowl(xi?xj)01/2l when 1?i?j ?4 and xi,xj
are chosen uniformly at random - ELij(LiLj)/2l22l/2l2l
- The expected number of elements common between
L12 and L34 that will yield the desired solutions
is L12L34/2n-l (l?n/3 will give us at least
1) - Complexity
- O(2n/3) time and space
26void the_k_sum_problem()(Continue Wagners
Algorithm)
- Improvisations
- We dont need low l bits to be zero. We can fix
them to any a (i.e. ) - The value 0 in x1? ?xk0 can be replaced with a
constant c of our choice (by replacing Lk with
LkLk?c) - If kk the complexity of the k-sum problem can
be no larger than the complexity of the k-sum
problem (just pick arbitrary xk1,,xk, define
cxk1? ?xk and use k-sum alg to find a
solution for x1? ?xkc) ? - we can solve k-sum problem with complexity at
most O(2n/3) for all k?4
27void the_k_sum_problem()(Continue Wagners
Algorithm)
- Extending the 4 lists case
- Create complete binary tree of depth logk.
- At depth h well use
- So well get an algorithm that requires
- time and space
- Note if k is not a power of 2 well take k to
be - - the largest power of 2 less than k, using
afterwards the list elimination trick
28void can_we_do_it_better_?()
- But Maybe theres a problem with the approach ?
- How many samples do we really need to get a
solution with good probability ? - Do we even need a basis ?
- Can we do it without scanning the whole space ?
- Do we need the best solution ?
- Yes
- Yes
- Klogk-log(-ln(1-e))
- Yes no
- Yes
- no
29void can_we_do_it_better_?()(Continue Sampling
space)
- To have a solution we need k linearly independent
vectors in our sampling space S. So - Well want where e?0,1
-
- ? sampling spaceO(klogkf(e))
30void annealing()
- Physical process of heating up solid until it
melts, followed by cooling it down into a state
of perfect lattice. - Problem finding, among potentially very large
number of solutions, a solution with minimal
cost. - Note We dont even need the minimal cost
solution - just one who has a noise rate below
our threshold
31void annealing()(Continue Combinatorial
optimization)
- Some definitions
- The set of solutions to the combinatorial problem
is taken as the set of states S - Note In our case
- The price function is the energy ES ? R that
we minimize - The transition probability between neighboring
states depends on their energy difference and an
external temperature T
32void annealing()(Continue Pseudo code
algorithm)
- Set T to a high temperature
- Choose an arbitrary initial state c
- Loop
- Select a neighbor c of c set ?E E(c')-E(c)
- If ?E probability exp(-?E/T).
- Do the 2 steps above several more times
- Decrease T
- Wait long enough and cross fingers(preferably
more than 2)
33void annealing()(Continue Problems)
- Problems
- Not all states can yield our new sample (only the
ones containing at least one vector from
S\basis). - The probability that a capable state will yield
the zero vector is 1/2k - The probability that any 1?j?k vectors from S
will yield a solution is - Note When S?k the phrase above approaches zero
34void annealing()(Continue Reduction)
- Idea
- Sample a little more than is needed SO(ck)
- Assign each vector its hamming weight and sort S
by it. - Reduction
-
- Spawning the next generation all the states
which includes a vector who has a hamming weight
? 2wt(?l) -
35void annealing()(Continue Convergence
Complexity ??)
- Complexity
- Where L denotes the number of steps to reach
quasi-equilibrium in each phase and ? denotes the
computation time of a transition - ln(S) denotes the number of phases to reach an
accepted solution, using polynomial-time cooling
schedule -
-
36Game Over
I dont even see the code anymore all I can
see now are blondes, brunettes, redheads -
Cipher (The matrix)
37void appendix()(GL)
- Theorem Suppose we have oracle access to random
process bx0,1n?0,1, so that -
where the probability - is taken uniformly over internal coin tosses of
bx and all possible choices of r, and b(x,r)
denote the inner-product mod 2 of x and r. - Then, We can in time polynomial in n/? output a
list of string that contains x with probability
at least ½.
38void appendix()(Continue GL highway)
- How ??
- 1 way (to extract xi)
- Suppose s(x)Prbx(r)b(x,r)?3/4? (hmmm??)
- The probability that both bx(r)b(x,r) and
bx(r?ei)b(x,r?ei) will hold is at least - but
39void appendix()(Continue GL better way)
- 2nd way
- Idea Guess b(x,r) by ourselves.
- Problem Need to guess polynomially many rs.
- Solution Generate polynomially many rs so that
they are sufficiently random but still we can
guess them with non-negligible probability.
40void appendix()(Continue GL better way)
- Construction
- Select uniformly strings in
0,1n and denote them by s1,,sl. - Guess
- The probability that all guesses are correct is
- assign each rj to different subsets of 1,..,l
s.t. - Note that
- Try all possibilities for ?1,,?l and output a
list of 2l candidates for zi?0,1n