Title: Nonparametric Techniques
1Nonparametric Techniques
- Shyh-Kang Jeng
- Department of Electrical Engineering/
- Graduate Institute of Communication/
- Graduate Institute of Networking and Multimedia,
National Taiwan University
2Problems of Parameter Estimation Approaches
- Common parametric forms rarely fit the densities
actually encountered in practice - All of the classical parametric densities are
unimodal - Many practical problems involve multimodal
densities - High-dimensional density not often be represented
as the product of one-dimensional functions
3Nonparametric Methods
- Can be used with arbitrary distributions
- Need no assumption on the forms of the underlying
densities - Basic types
- Estimating the density function p(xwj) from
samples - Directly estimating the a posteriori
probabilities p(wjx) - Bypass probability distribution and go directly
to decision function
4Density Estimation Naïve Approach
5Density Estimation Naïve Approach
Pk/Pk,max
6Problems of Naïve Approach
- If volume is fixed and take more samples, we get
only a space-averaged value of p(x) - If volume approaches zero and fix n, then
estimated p(x) will be close to zero or infinity
and useless
7Better Approaches
8Hypercube Parzen Windows
9Parzen Windows for Interpolation
10Examples of Parzen Windows
11Convergence Considerations
12Convergence of the Mean
13Convergence of the Variance
14Convergence of the Variance
15Illustration 1 1D Gaussian
16Illustration 1 1D Gaussian
17Illustration 2 2D Gaussian
18Illustration 3 Uniform and Triangular
19Classification Examples
20Pros and Cons of Nonparametric Methods
- Generality
- Number of samples needed may be very large
- Much larger than required if we know the form of
the density - Curse of dimensionality
- High-dimensional functions are much more
complicated and harder to discern - Better to incorporate knowledge about the data
that is correct
21Probabilistic Neural Networks (PNN)
22PNN Training
23Activation Function
24PNN Classification
25kn-Nearest-Neighbor Estimation
- Let cell volume be a function of the test data
- Prototypes
- Training samples
- Estimate p(x)
- Center a cell about x
- Let the cell grow until it captures kn samples
26kn-Nearest-Neighbor Estimation
27kn-Nearest-Neighbor Estimation
28kn-Nearest-Neighbor Estimation
- Sufficient and necessary conditions for pn(x)
converge to p(x) - Example
29kn-Nearest-Neighbor Estimation
30Estimation of A Posteriori Probabilities
- Place a cell of volume V (Parzen or
kn-nearest-neighbor) around x - capture k samples
- ki of them are labeled wi
- Estimate of Joint probability p(x,wi)
- Estimate for p(wix)
31Nearest-Neighbor Rule
- Dn x1, . . ., xn denote a set of n labeled
prototypes - x in Dn is the prototype nearest to a test
point x - Assign x the label associated with x
- Suboptimal
- Lead to an error rate greater than the Bayes rate
- but never worse than twice the Bayes rate
32Heuristic Understanding
- Label q associated with the nearest neighbor is
a random number - P(qwix)P(wix)
- When number of samples is very large, assume that
x is close to x that P(wix) approximately
equals P(wix) -
33Voronoi Tessellation
34Probability of Error
35Convergence of Nearest Neighbor
36Error Rate for Nearest-Neighbor Rule
37Error Rate for Nearest-Neighbor Rule
38Approximate Error Bound
39A More Rigorous Approach
40A More Rigorous Approach
41A More Rigorous Approach
42A More Rigorous Approach
- The upper bound is achieved in the
zero-information case - p(xwi) are identical
- P(wix) P(wi)
- P(ex) is independent of x
- P is between 0 and (c-1)/c
43Bounds of Nearest-Neighbor Error Rate
44Convergence Speed
- Convergence can be arbitrarily slow
- Pn(e) need not even decrease monotonically with n
45k-Nearest-Neighbor Rule
46Simplified Analysis Results
- Two-class case with k odd
- Labels on each of the k-nearest-neighbors are
random variables - Independently assume the values wi with
probabilities P(wix) - Select wm if a majority of the k nearest
neighbors are labeled wm, with probability
47Simplified Analysis Results
- The larger the value of k, the greater the
probability that wm will be selected - Large-sample two-class error rate is bounded
above by Ck(P) defined to be the smallest
concave function of P greater than
48Upper Bounds of Error Rate for Two-Class Cases
49More Comments on k-Nearest-Neighbor Rule
- As an attempt to estimate P(wix) from samples
- Needs larger k to obtain a reliable estimate
- Want all k nearest neighbors x to be very near x
to ensure P(wix) is approximately the same as
P(wix) - k should be a small fraction of n
- Only when n goes to infinity that we can be
assured of nearly optimal behavior
50Computational Complexity of the Nearest-Neighbor
Rule
- Inspect each stored point in turn
- O(n)
- Calculate its Euclidean distance to x
- Each calculation O(d)
- Returning identity only of the current closest
one - Total complexity O(dn)
51A Parallel Nearest-Neighbor Circuit
52Reducing Computational Burden in Nearest-Neighbor
Search
- Computing partial distance
- Prestructuring
- Create some form of search tree , e.g., a
quad-tree with representative points - Prototypes are selectively linked
- Not guaranteed to find the closest prototype
- Editing the stored prototypes
53Nearest-Neighbor Editing
- Initialize j?0, D?data set, n?prototypes
- construct full Voronoi diagram of D
- do j ? j 1 for each prototype xj
- find Voronoi neighbors of xj
- if any neighbor not from the same class
- as xj, then mark xj
- until jn
- discard all points that are not marked
- construct the Voronoi diagram of the
- remaining (marked) prototypes
- end
54Nearest-Neighbor Editing
- Complexity
- Not guarantee the minimum set of points
- Reduce complexity without affecting the accuracy
- Generally can not add training data later
- Can be combined with prestructuring and partial
distance
55Properties of Metrics
56Effect of Scaling in Euclidean Distance
57Minkowski Metric (Lk Norm)
58Tanimoto Metric
- Finds most use in taxonomy
- When two patterns or features are either the same
or different - Distance between two sets
59Uncritical Use of Euclidean Metric
60A Naïve Approach
- Compute distance only when the patterns have been
transformed to be as similar to one another as
possible - The computational burden is prohibitive
- Do not know the proper parameters for the
transformation ahead of time - More serious if several transformations for each
stored prototype during classification are to be
considered
61Tangent Vector
- r transformations are applicable
- e.g., horizontal translation, shear, line
thinning for hand-written images - For each prototype x, perform each of the
transformation Fi(xai) - Tangent vector for each transformation
62Linear Combination of Tangent Vectors
63Tangent Distance
- Construct an r by d matrix T through tangent
vectors at x - Tangent distance from x to x
64Concept of Tangent Distance
65Category Membership Functions in Fuzzy Logic
Medium-light
Dark
Light
Medium-dark
66Conjunction Rule and Discriminant Function
67Cox-Jaynes Axioms for Category Membership
Functions
68Contribution and Limitation
- Guiding the steps by which one takes knowledge in
a linguistic form and casts it into discriminant
functions - Do not rely on data
69Reduced Coulomb Energy (RCE) Networks
70RCE Training
71RCE Training
72RCE Classification
73Approximations by Series Expansions
74One-Dimensional Example