Friday, Febuary 2, 2001 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Friday, Febuary 2, 2001

Description:

CIS 830: Advanced Topics in Artificial Intelligence. KSU. Friday, Febuary 2, 2001 ... If sufficiently long running period is allowed and a good random function is ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 28
Provided by: lindajacks
Category:

less

Transcript and Presenter's Notes

Title: Friday, Febuary 2, 2001


1
Presentation
Aspects Of Feature Selection for KDD
Friday, Febuary 2, 2001 PresenterAjay
Gavade Paper 2 Liu and Motoda, Chapter 3
2
Outline
  • Categories of Feature Selection Algorithms
  • Feature Ranking Algorithms
  • Minimum Subset Algorithms
  • Basic Feature Generation Schemes Algorithms
  • How do we generate subsets?
  • Forward, backward, bidirectional, random
  • Search Strategies Algorithms
  • How do we systematically search for a good
    subset?
  • Informed Uninformed Search
  • Complete search
  • Heuristic search
  • Nondeterministic search
  • Evaluation Measure
  • How do we tell how good a candidate subset is?
  • Information gain, Entropy.

3
The Major Aspects Of Feature Selection
  • Search Directions (Feature Subset Generation)
  • Search Strategies
  • Evaluation Measures
  • A Particular method of feature selection is a
    combination of some possibilities of every
    aspect. Hence each method can be represented by a
    point in the 3-D structure.

4
Major Categories of Feature Selection Algorithms
(From The Point Of View Of Methods Output)
  • Feature Ranking Algorithms
  • These algorithms return a ranked list of features
    ordered according to some evaluation measure. The
    algorithm tells the importance (relevance) of a
    feature compared to others.

5
Major Categories of Feature Selection Algorithms
(From The Point Of View Of Methods Output)
  • Minimum Subset Algorithms
  • These algorithms return a minimum feature subset
    , and no difference is made for features in the
    subset. Theses algorithms are used when we dont
    know the number of relevant features.

6
Basic Feature Generation Schemes
  • Sequential Forward Generation
  • Starts with empty set and adds features from the
    original set sequentially. Features are added
    according to relevance.
  • N-step look-ahead form.
  • One -step look-ahead form is the most commonly
    used schemes because of good efficiency
  • A minimum feature subset or ranked list can be
    obtained.
  • Can deal with noise in data.

7
Basic Feature Generation Schemes
  • Sequential Backward Generation
  • Starts with full set and removes one feature at a
    time from the original set sequentially. Least
    relevant feature is removed.
  • But this tells nothing about the ranking of the
    relevant features remaining.
  • Doesn't guarantee absolute minimal subset.

8
Basic Feature Generation Schemes
  • Bidirectional Generation
  • This runs SFG and SBG in parallel, and stops when
    one algorithm finds a satisfactory subset.
  • Optimizes the speed if number of relevant
    features is unknown.

9
Basic Feature Generation Schemes
  • Random Generation
  • Sequential Generation Algorithms are fast on
    average, but they cant guarantee absolute
    minimum valid set i.e. optimal feature subset.
    Because if they hit a local minimum (a best
    subset at the moment) they have no way to get
    out.
  • Random Generation scheme produces subset at
    random. A good random number generator is
    required so that every combination of features
    ideally has a chance to occur and occurs just
    once.

10
Search Strategies
  • Exhaustive Search
  • Exhaustive search is complete since it covers all
    combinations of features. But a complete search
    may not be exhaustive.
  • Depth-First Search
  • This search goes down one branch entirely, and
    then backtracks to another branch.This uses stack
    data structure (explicit or implicit)

11
Depth-First Search
Depth-First Search 3 features a,b,c
a
b
c
a, b
a,c
b,c
a,b,c
12
Search Strategies
  • Breadth-First Search
  • This search moves down layer by layer, checking
    all subsets with one feature , then with two
    features , and so on. This uses queue data
    structure.
  • Space Complexity makes it impractical in most
    cases.

13
Breadth-First Search
Breadth-First Search 3 features a,b,c
a
b
c
a, b
a,c
b,c
a,b,c
14
Search Strategies
  • Complete Search
  • Branch Bound Search
  • It is a variation of depth-first search hence it
    is exhaustive search.
  • If evaluation measure is monotonic, this search
    is a complete search and guarantees optimal
    subset.

15
Branch Bound Search
Branch Bound Search 3 features a,b,c Bound Beta
12
11
a,b,c
12
15
13
a, b
a,c
b,c
17
17
9
a
b
c
1000
16
Heuristic Search
  • Quick To Find Solution (Subset of Features)
  • Finds Near Optimal Solution
  • More Speed With Little Loss of Optimality
  • Best-First Search
  • This is derived from breadth-first search. This
    expands its search space layer by layer , and
    chooses one best subset at each layer to expand.
  • Beam Search

17
Best-First Search
Best-First Search 3 features a,b,c
1000
17
19
18
a
b
c
12
13
10
a, b
a,c
b,c
20
a,b,c
18
Search Strategies
  • Approximate Branch Bound Search
  • This is an extension of the Branch Bound Search
  • In this the bound is relaxed by some amount ?,
    this allows algorithm to continue and reach
    optimal subset. By changing ? , monotonicity of
    the measure can be observed.

19
Approximate Branch Bound Search
Approximate Branch Bound Search 3 features
a,b,c
11
a,b,c
13
15
12
a,b
a,c
b,c
9
17
17
a
b
c
1000
20
Nondeterministic Search
  • Avoid Getting Stuck in Local Minima
  • Capture The Interdependence of Features
  • RAND
  • It keeps only the current best subset.
  • If sufficiently long running period is allowed
    and a good random function is used, it can find
    optimal subset. Problem with this algorithm is we
    dont know when we reached the optimal subset.
    Hence stopping condition is the number of
    maximum loops allowed.

21
Evaluation Measures
  • What is Entropy ?
  • A Measure of Uncertainty
  • The Quantity
  • Purity how close a set of instances is to having
    just one label
  • Impurity (disorder) how close it is to total
    uncertainty over labels
  • The Measure Entropy
  • Directly proportional to impurity, uncertainty,
    irregularity, surprise
  • Inversely proportional to purity, certainty,
    regularity, redundancy
  • Example
  • For simplicity, assume H 0, 1, distributed
    according to Pr(y)
  • Can have (more than 2) discrete class labels
  • Continuous random variables differential entropy
  • Optimal purity for y either
  • Pr(y 0) 1, Pr(y 1) 0
  • Pr(y 1) 1, Pr(y 0) 0
  • Entropy is 0 if all members of S belong to same
    class.

22
EntropyInformation Theoretic Definition
  • Components
  • D a set of examples ltx1, c(x1)gt, ltx2, c(x2)gt,
    , ltxm, c(xm)gt
  • p Pr(c(x) ), p- Pr(c(x) -)
  • Definition
  • H is defined over a probability density function
    p
  • D contains examples whose frequency of and -
    labels indicates p and p- for the observed data
  • The entropy of D relative to c is H(D) ?
    -p logb (p) - p- logb (p-)
  • If a target attribute can take on c different
    values, the entropy of S relative to this c-wise
    classification is defined as ,
  • where pi is the proportion of S belonging to the
    class I.

23
Entropy
  • What is the least pure probability distribution?
  • Pr(y 0) 0.5, Pr(y 1) 0.5
  • Corresponds to maximum impurity/uncertainty/irregu
    larity/surprise
  • Property of entropy concave function (concave
    downward)
  • Entropy is 1 when S contains equal number of
    positive negative examples.
  • Entropy specifies the minimum number of bits of
    information needed to encode the classification
    of an arbitrary member of S.
  • What Units is H Measured In?
  • Depends on the base b of the log (bits for b 2,
    nats for b e, etc.)
  • A single bit is required to encode each example
    in the worst case (p 0.5)
  • If there is less uncertainty (e.g., p 0.8), we
    can use less than 1 bit each

24
Information Gain
  • It is a measure of the effectiveness of an
    attribute in classifying the training data.
  • Measures the expected reduction in Entropy caused
    by partitioning the examples according to the
    attribute.
  • Measure the uncertainty removed by splitting on
    the value of attribute A
  • The information gain ,Gain(S,A) of an attribute
    A, relative to collection of examples S is,
  • where values(A) is the set of all possible values
    of A.
  • Gain(S,A) is the information provided about the
    target function value, given the value of some
    attribute A.
  • The value of Gain(S,A) is the number of bits
    saved when encoding the target value of an
    arbitrary member of S, by knowing the value of
    attribute A.

25
An Illustrative Example
26
Attributes with Many Values
27
Summary Points
  • Search Measure
  • Search and measure play dominant role in feature
    selection.
  • Stopping criteria are usually determined by a
    particular combination of search measure.
  • There are different feature selection methods
    with different combinations of search
    evaluation measures.
  • Heuristic Search Inductive Bias Inductive
    Generalization
  • Entropy and Information Gain
  • Goal to measure uncertainty removed by splitting
    on a candidate attribute A
  • Calculating information gain (change in entropy)
  • Using information gain in construction of tree
Write a Comment
User Comments (0)
About PowerShow.com