Noise Tolerant Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Noise Tolerant Learning

Description:

'Noise-tolerant learning, the parity problem, and the ... all I can see now are blondes, brunettes, redheads...' - Cipher ('The matrix') void appendix ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 41
Provided by: Adi
Category:

less

Transcript and Presenter's Notes

Title: Noise Tolerant Learning


1
Noise Tolerant Learning
  • Presented by Aviad Maizels

Based on Noise-tolerant learning, the parity
problem, and the statistical query model \
Avrim Blum, Adam Kalai and Hal Wasserman A
Generalized Birthday problem \ David
Wagner Hard-core predicates for any one way
function \ Goldreich O. and L.A.Levin Simulated
annealing and Boltzmann machines \ Emile Aarts
and Jan Korst
2
void Agenda()
  • do
  • A few sentences about Codes
  • The opposite problem
  • Learning with noise
  • The k-sum problem
  • Can we do it faster ??
  • Annealing
  • while (!understandable)


3
void fast_introduction_to_LECC()
  • The communication channel may disrupt the
  • original data

Proposed solution encode messages to give some
protection against errors.
4
void fast_introduction_to_LECC()(Continue
terminology)
Source
Encoder
Channel
msgu1u2uk
codewordx1x2xn
  • Linear Codes
  • Fixed sized block code
  • Additive closure
  • Code is tagged using two parameters (n,k)
  • k data size
  • n encoded word size

noise
5
void fast_introduction_to_LECC()(Continue
terminology)
  • Systematic code original data appears directly
    inside the codeword.
  • Generating matrix (G) - a matrix s.t. multiplying
    it with a message will output the encoded word.
  • Num of rows space dimension (k)
  • Every codeword can be represented as a linear
    combination of Gs rows.

6
void fast_introduction_to_LECC()(Continue
terminology)
  • Hamming distance the number of places two
    vectors differ in
  • Denoted by dist(x,y)
  • Hamming weight the number of places that differ
    from zero in a vector
  • Denoted by wt(x)
  • Minimum distance of linear code minimum weight
    of any non-zero vector

7
void fast_introduction_to_LECC()(Continue
terminology)
Channel
Decoder
Target
received wordx e
msg ??
error vectore1e2en
  • Perfect code (t)- Every vector has hamming
    distance

8
void fast_introduction_to_LECC()(Continue
terminology)
...
  • Complete Decoding The acceptance groups around
    the codewords together contains all the vectors
    of length n


9
void the_opposite_problem()
  • Decoding linear (n,k) codes in the presence of
    random noise when k O(logn) in poly(n)-time.
  • k O(logn) is trivial
  • in !(coding-theory) terms
  • Given a finite set of code words (examples) of
    length n, their labels and a codeword ,
    find\learn the label of , in the presence of
    random noise, in poly(n) time.

10
void the_opposite_problem()(Continue Main idea)
  • Without noise
  • Any vector can be written as a linear combination
    of previously seen examples.
  • Deducing the vectors label can be done in the
    same way.
  • So All we need is to find a basis to deduce any
    label of a new example.
  • Qs Is it the same with the presence of noise ??

11
void the_opposite_problem()(Continue Main idea)
  • Well No.
  • Summing examples actually boosts the noise
  • Given s examples and a noise rate of ? sum of s examples has a noise rate of
  • ½ ½(1-2?)s
  • write basis vectors as a sum of small number of
    examples and the new sample as a linear
    combination of the above.


12
void learning_with_noise()
  • Concept boolean function over the input space
  • Concept class set of concepts
  • World model
  • There is a fixed noise rate ?
  • Fixed probability distribution D over the input
    space
  • The alg. may ask for labeled example (x,l).
  • an unknown concept c.

13
void learning_with_noise()
  • Goal Find an e-approximation of c
  • a function h s.t. Prx?Dh(x) c(x) 1-e
  • Parity function defined by a corresponding
    vector v?0,1n. The function is then given by
    the rule

14
void learning_with_noise()(Continue
Preliminaries)
  • Efficiently learnable Concept class C is E.L. in
    the presence of random classification noise under
    distribution D if
  • ? alg A s.t. ? e0, d0, ?0 and ? concept c?C
  • A produces an e-approximation of c with
    probability at least 1- d when given access to
    D-random examples.
  • A must run in time polynomial in n,1/e,1/ d and
    in 1/(1/2- ?).

15
void learning_with_noise()(Continue Goal)
  • Well show that The length-k parity problem for
    noise rate ?time and total size of examples of 2O(k/logk).
  • Observe the behavior of the noise when were
    adding up examples

16
void learning_with_noise()(Continue Noise
behavior)
p1 appearing frequency of noisy bit. q1
appearing frequency of correct bit.
1010111
1111011
p2 appearing frequency of noisy bit. q2
appearing frequency of correct bit.
  • pi qi 1
  • Denote si pi-qi 2pi1 12qi
    si?-1,1
  • ? p3 p1q2p2q1 q3 p1p2 q1q2
  • ? s3 p3q3 s1s2
  • ?

17
void learning_with_noise()(Continue Idea)
  • Main idea Draw much more examples than needed to
    find basis vectors as a sum of relatively small
    number of examples.
  • If ?polynomially indistinguishable from random
  • We can repeat the process to boost reliability

18
void learning_with_noise()(Continue
Definitions)
  • A few more definitions
  • k ab
  • Vi - subspace of 0,1ab consisting of vectors
    whose last i blocks are zeroed
  • i-sample set of independent vectors that are
    uniformly distributed over Vi

19
void learning_with_noise()(Continue Main
construction)
  • Construction Given i-sample of size s, we
    construct (i1)-sample of size at least s-2b in
    time O(s)
  • Behold
  • i-samplex1,,xs.
  • Partition the xs based on the (a-i) block (well
    get max 2b partitions).
  • For each non-empty partition, pick a random
    vector, add it to the other vectors on his
    partition and then discard the vector.
  • Result z1,,zm vectors, ms-2b where
  • The block (a-i-1) is zeroed out
  • zj are independent uniformly distributed over
    Vi1

20
void learning_with_noise()(Continue Algorithm)
  • Algorithm (Finding the 1st bit)
  • Ask for a2b labeled examples
  • Apply construction (a-1) times to get
    (a-1)-sample
  • There is 1-1/e chance that the vector (1,0,,0)
    will be a member of the (a-1)- sample. If its
    not there, well do it again with new labeled
    examples (expected number of repetitions is
    constant)
  • Note weve written (1,0,,0) as a sum of 2(a-1)
    examples, causing the noise rate to boost to

21
void learning_with_noise()(Continue
Observations)
  • Observations
  • We found the first bit of our new sample using
    the number of examples and computation time in
    poly
  • We can shift all examples to determine the
    remainder bits
  • Fixing a(1/2)logk and b2k/logk will give the
    desired
  • for a constant noise rate ?.


22
void the_k_sum_problem()
  • The key to improve the above alg is to find a
    better way to solve a problem similar to k-sum.
  • Problem Given k lists L1,,Lk of elements, drawn
    uniformly and independently from 0,1n, find
    x1?L1,,xk?Lk s.t.
  • Note a solution to the k-sum problem exists
    with good probability if L1L2Lk 2n
    (Similar to birthday paradox)

23
void the_k_sum_problem()(Continue Wagners
Algorithm - Definitions)
  • Preliminary definitions and observations
  • Lowl(x) the l LS bits of x
  • L1 xl L2 contains all pairs from L1 x L2 that
    agree on the l LS bits.
  • If lowl (x1?x2)0 and lowl (x3?x4)0 then lowl
    (x1?x2?x3?x4)0 and Prx1?x2?x3?x402l/2n
  • Join (xl) operation
  • Hash join stores one list and scans through the
    other
  • (L1 L2) steps, O(L1L2) storage
  • Merge join sorts scans the two sorted lists
  • O(max(L1,L2)log(max(L1,L2))) time

24
void the_k_sum_problem()(Continue Wagners
Algorithm Simple case)
  • The 4 lists case
  • Extends lists until they each contains 2l
    elements
  • Generate a new list L12 of values x1?x2 s.t.
    lowl(x1?x2)0 and a new list L34 in the same way
  • Search for matches between L12 and L34

25
void the_k_sum_problem()(Continue Wagners
Algorithm)
  • Observation
  • Prlowl(xi?xj)01/2l when 1?i?j ?4 and xi,xj
    are chosen uniformly at random
  • ELij(LiLj)/2l22l/2l2l
  • The expected number of elements common between
    L12 and L34 that will yield the desired solutions
    is L12L34/2n-l (l?n/3 will give us at least
    1)
  • Complexity
  • O(2n/3) time and space

26
void the_k_sum_problem()(Continue Wagners
Algorithm)
  • Improvisations
  • We dont need low l bits to be zero. We can fix
    them to any a (i.e. )
  • The value 0 in x1? ?xk0 can be replaced with a
    constant c of our choice (by replacing Lk with
    LkLk?c)
  • If kk the complexity of the k-sum problem can
    be no larger than the complexity of the k-sum
    problem (just pick arbitrary xk1,,xk, define
    cxk1? ?xk and use k-sum alg to find a
    solution for x1? ?xkc) ?
  • we can solve k-sum problem with complexity at
    most O(2n/3) for all k?4

27
void the_k_sum_problem()(Continue Wagners
Algorithm)
  • Extending the 4 lists case
  • Create complete binary tree of depth logk.
  • At depth h well use
  • So well get an algorithm that requires
  • time and space
  • Note if k is not a power of 2 well take k to
    be
  • - the largest power of 2 less than k, using
    afterwards the list elimination trick


28
void can_we_do_it_better_?()
  • But Maybe theres a problem with the approach ?
  • How many samples do we really need to get a
    solution with good probability ?
  • Do we even need a basis ?
  • Can we do it without scanning the whole space ?
  • Do we need the best solution ?
  • Yes
  • Yes
  • Klogk-log(-ln(1-e))
  • Yes no
  • Yes
  • no

29
void can_we_do_it_better_?()(Continue Sampling
space)
  • To have a solution we need k linearly independent
    vectors in our sampling space S. So
  • Well want where e?0,1
  • ? sampling spaceO(klogkf(e))


30
void annealing()
  • Physical process of heating up solid until it
    melts, followed by cooling it down into a state
    of perfect lattice.
  • Problem finding, among potentially very large
    number of solutions, a solution with minimal
    cost.
  • Note We dont even need the minimal cost
    solution - just one who has a noise rate below
    our threshold

31
void annealing()(Continue Combinatorial
optimization)
  • Some definitions
  • The set of solutions to the combinatorial problem
    is taken as the set of states S
  • Note In our case
  • The price function is the energy ES ? R that
    we minimize
  • The transition probability between neighboring
    states depends on their energy difference and an
    external temperature T

32
void annealing()(Continue Pseudo code
algorithm)
  • Set T to a high temperature
  • Choose an arbitrary initial state c
  • Loop
  • Select a neighbor c of c set ?E E(c')-E(c)
  • If ?E probability exp(-?E/T).
  • Do the 2 steps above several more times
  • Decrease T
  • Wait long enough and cross fingers(preferably
    more than 2)

33
void annealing()(Continue Problems)
  • Problems
  • Not all states can yield our new sample (only the
    ones containing at least one vector from
    S\basis).
  • The probability that a capable state will yield
    the zero vector is 1/2k
  • The probability that any 1?j?k vectors from S
    will yield a solution is
  • Note When S?k the phrase above approaches zero

34
void annealing()(Continue Reduction)
  • Idea
  • Sample a little more than is needed SO(ck)
  • Assign each vector its hamming weight and sort S
    by it.
  • Reduction
  • Spawning the next generation all the states
    which includes a vector who has a hamming weight
    ? 2wt(?l)

35
void annealing()(Continue Convergence
Complexity ??)
  • Complexity
  • Where L denotes the number of steps to reach
    quasi-equilibrium in each phase and ? denotes the
    computation time of a transition
  • ln(S) denotes the number of phases to reach an
    accepted solution, using polynomial-time cooling
    schedule

36
Game Over
I dont even see the code anymore all I can
see now are blondes, brunettes, redheads -
Cipher (The matrix)
37
void appendix()(GL)
  • Theorem Suppose we have oracle access to random
    process bx0,1n?0,1, so that

  • where the probability
  • is taken uniformly over internal coin tosses of
    bx and all possible choices of r, and b(x,r)
    denote the inner-product mod 2 of x and r.
  • Then, We can in time polynomial in n/? output a
    list of string that contains x with probability
    at least ½.

38
void appendix()(Continue GL highway)
  • How ??
  • 1 way (to extract xi)
  • Suppose s(x)Prbx(r)b(x,r)?3/4? (hmmm??)
  • The probability that both bx(r)b(x,r) and
    bx(r?ei)b(x,r?ei) will hold is at least
  • but

39
void appendix()(Continue GL better way)
  • 2nd way
  • Idea Guess b(x,r) by ourselves.
  • Problem Need to guess polynomially many rs.
  • Solution Generate polynomially many rs so that
    they are sufficiently random but still we can
    guess them with non-negligible probability.

40
void appendix()(Continue GL better way)
  • Construction
  • Select uniformly strings in
    0,1n and denote them by s1,,sl.
  • Guess
  • The probability that all guesses are correct is
  • assign each rj to different subsets of 1,..,l
    s.t.
  • Note that
  • Try all possibilities for ?1,,?l and output a
    list of 2l candidates for zi?0,1n
Write a Comment
User Comments (0)
About PowerShow.com