Decision List - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Decision List

Description:

A decision list (DL) is an ordered list of ... cote c t , c te, cote, cot ... for each ambiguous de-accented form: e.g., one for cesse, another one for cote ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 45
Provided by: facultyWa9
Category:
Tags: cote | decision | list

less

Transcript and Presenter's Notes

Title: Decision List


1
Decision List
  • LING 572
  • Fei Xia
  • 1/12/06

2
Outline
  • Basic concepts and properties
  • Case study

3
Definitions
  • A decision list (DL) is an ordered list of
    conjunctive rules.
  • Rules can overlap, so the order is important.
  • A k-DL the length of every rule is at most k.
  • A decision tree determines an examples class by
    using the first matched rule.

4
An example
  • A simple DL
  • If X1v11 X2v21 then c1
  • If X2v21 X3v34 then c2
  • Classify an example(v11,v21,v34)
  • The DL is 2-DL.

5
Rivests paper
  • It assumes that all attributes (including goal
    attribute) are binary.
  • It shows DL is easily learnable from examples.

6
Assignment and formula
  • Input attributes x1, , xn
  • An assignment gives each input attribute a value
    (1 or 0) e.g., 10001
  • A boolean formula (function) maps each assignment
    to a value (1 or 0)

7
  • Two formulae are equivalent if they give the same
    value for same input.
  • Total number of different formulae
  • ? Classification problem learn a formula given a
    partial table

8
CNF an DNF
  • Literal
  • Term conjunction (and) of literals
  • Clause disjunction (or) of literals
  • CNF (conjunctive normal form) the conjunction of
    clauses.
  • DNF (disjunctive normal form) the disjunction of
    terms.
  • k-CNF and k-DNF

9
A slightly different definition of DT
  • A decision tree (DT) is a binary tree where each
    internal node is labeled with a variable, and
    each leaf is labeled with 0 or 1.
  • k-DT the depth of a DT is at most k.
  • A DT defines a boolean formula look at the paths
    whose leaf node is 1.
  • An example

10
Decision list
  • A decision list is a list of pairs
  • (f1, v1), , (fr, vr),
  • fi are terms, and frtrue.
  • A decision list defines a boolean function
  • given an assignment x, DL(x)vj, where j is
    the least index s.t. fj(x)1.

11
Relations among different representations
  • CNF, DNF, DT, DL
  • k-CNF, k-DNF, k-DT, k-DL
  • For any k lt n, k-DL is a proper superset of the
    other three.
  • Compared to DT, DL has a simple structure, but
    the complexity of the decisions allowed at each
    node is greater.

12
k-CNF and k-DNF are proper subsets of k-DL
  • k-DNF is a subset of k-DL
  • Each term t of a DNF is converted into a decision
    rule (t, 1).
  • k-CNF is a subset of k-DL
  • Every k-CNF is a complement of a k-DNF k-CNF and
    k-DNF are duals of each other.
  • The complement of a k-DL is also a k-DL.
  • Neither k-CNF nor k-DNF is a subset of the other
  • Ex 1-DNF

13
K-DT is a proper set of k-DL
  • K-DT is a subset of k-DNF
  • Each leaf labeled with 1 maps to a term in
    k-DNF.
  • K-DT is a subset of k-CNF
  • Each leaf labeled with 0 maps to a clause in
    k-CNF
  • ? k-DT is a subset of

14
K-DT, k-CNF, k-DNF and k-DT
k-CNF
k-DNF
k-DT
K-DL
15
Learnability
  • Positive examples vs. negative examples of the
    concept being learned.
  • In some domains, positive examples are easier to
    collect.
  • A sample is a set of examples.
  • A boolean function is consistent with a sample if
    it does not contradict any example in the sample.

16
Two properties of a learning algorithm
  • A learning algorithm is economical if it requires
    few examples to identify the correct concept.
  • A learning algorithm is efficient if it requires
    little computational effort to identify the
    correct concept.
  • ? We prefer algorithms that are both economical
    and efficient.

17
Hypothesis space
  • Hypothesis space F a set of concepts that are
    being considered.
  • Hopefully, the concept being learned should be in
    the hypothesis space of a learning algorithm.
  • The goal of a learning algorithm is to select the
    right concept from F given the training data.

18
  • Discrepancy between two functions f and g
  • Ideally, we want to be as small as
    possible.
  • To deal with bad luck in drawing example
    according to Pn, we define a confidence
    parameter

19
Polynomially learnable
  • A set of Boolean functions is polynomially
    learnable if there exists an algorithm A and a
    polynomial function
  • when given a sample of f of size
  • drawn according to Pn, A will with probability
    at least output a
    s.t.
  • Furthermore, As running time is polynomially
    bounded in n and m.
  • K-DL is polynomially learnable.

20
How to build a decision list
  • Decision tree ? Decision list
  • Greedy, iterative algorithm that builds DLs
    directly.

21
The algorithm in (Rivest, 1987)
  • If the example set S is empty, halt.
  • Examine each term of length k until a term t is
    found s.t. all examples in S which make t true
    are of the same type v.
  • Add (t, v) to decision list and remove those
    examples from S.
  • Repeat 1-3.

22
The general greedy algorithm
  • RuleList, Etraining_data
  • Repeat until E is empty or gain is small
  • f Find_best_feature(E)
  • Let E be the examples covered by f
  • Let c be the most common class in E
  • Add (f, c) to RuleList
  • EE E

23
Problem of greedy algorithm
  • The interpretation of rules depends on preceding
    rules.
  • Each iteration reduces the number of training
    examples.
  • Poor rule choices at the beginning of the list
    can significantly reduce the accuracy of DL
    learned.
  • ? Several papers on alternative algorithms

24
Summary of (Rivest, 1987)
  • Formal definition of DL
  • Show the relation between k-DL, k-CNF, k-DNF and
    k-DL.
  • Prove that k-DL is polynomially learnable.
  • Give a simple greedy algorithm to build k-DL.

25
Outline
  • Basic concepts and properties
  • Case study

26
In practice
  • Input attributes and the goal are not necessarily
    binary.
  • Ex the previous word
  • A term ? a feature (it is not necessarily a
    conjunction of literals)
  • Ex the word appears in a k-word window
  • Only some feature types are considered, instead
    of all possible features
  • Ex previous word and next word
  • Greedy algorithm quality measure
  • Ex a feature with minimum entropy

27
Case study accent restoration
  • Task to restore accents in Spanish and French
  • ? A special case of WSD
  • Ex ambiguous de-accented forms
  • cesse ? cesse, cessé
  • cote ?côté, côte, cote, coté
  • Algorithm build a DL for each ambiguous
    de-accented form e.g., one for cesse, another
    one for cote
  • Attributes words within a window

28
The algorithm
  • Training
  • Find the list of de-accent forms that are
    ambiguous.
  • For each ambiguous form, build a decision list.
  • Testing check each word in a sentence
  • if it is ambiguous,
  • then restore the accent form according to the DL

29
Step 1 Identify forms that are ambiguous
30
Step 2 Collecting training context
Context the previous three and next three
words. Strip the accents from the data. Why?
31
Step 3 Measure collocational distributions
Feature types are pre-defined.
32
Collocations
33
Step 4 Rank decision rules by log-likelihood
There are many alternatives.
word class
34
Step 5 Pruning DLs
  • Pruning
  • Cross-validation
  • Remove redundant rules WEEKDAY rule precedes
    domingo rule.

35
Building DL
  • For a de-accented form w, find all possible
    accented forms
  • Collect training contexts
  • collect k words on each side of w
  • strip the accents from the data
  • Measure collocational distributions
  • use pre-defined attribute combination
  • Ex -1 w, 1w, 2w
  • Rank decision rules by log-likelihood
  • Optional pruning and interpolation

36
Experiments
Prior (baseline) choose the most common form.
37
Global probabilities vs. Residual probabilities
  • Two ways to calculate the log-likelihood
  • Global probabilities using the full data set
  • Residual probabilities using the residual
    training data
  • More relevant, but less data and more expensive
    to compute.
  • Interpolation use both
  • In practice, global probability works better.

38
Combining vs. Not combining evidence
  • Each decision is based on a single piece of
    evidence.
  • Run-time efficiency and easy modeling
  • It works well, at least for this task, but why?
  • Combining all available evidence rarely produces
    different results
  • The gross exaggeration of prob from combining
    all of these non-independent log-likelihood is
    avoided

39
Summary of case study
  • It allows a wider context (compared to n-gram
    methods)
  • It allows the use of multiple, highly
    non-independent evidence types (compared to
    Bayesian methods)
  • ? kitchen-sink approach of the best kind

40
Advance topics
41
Probabilistic DL
  • DL a rule is (f, v)
  • Probabilistic DL a rule is
  • (f, v1/p1 v2/p2 vn/pn)

42
Entropy of a feature q
fired
not fired
43
Algorithms for building DL
  • AQ algorithm (Michalski, 1969)
  • CN2 algorithm (Clark and Niblett, 1989)
  • Segal and Etzioni (1994)
  • Goodman (2002)

44
Summary of decision list
  • Rules are easily understood by humans (but
    remember the order factor)
  • DL tends to be relatively small, and fast and
    easy to apply in practice.
  • DL is related to DT, CNF, DNF, and TBL.
  • Learning greedy algorithm and other improved
    algorithms
  • Extension probabilistic DL
  • Ex if A B then (c1, 0.8) (c2, 0.2)
Write a Comment
User Comments (0)
About PowerShow.com