Title: Decision List
1Decision List
2Outline
- Basic concepts and properties
- Case study
3Definitions
- A decision list (DL) is an ordered list of
conjunctive rules. - Rules can overlap, so the order is important.
- A k-DL the length of every rule is at most k.
- A decision tree determines an examples class by
using the first matched rule.
4An example
- A simple DL
- If X1v11 X2v21 then c1
- If X2v21 X3v34 then c2
- Classify an example(v11,v21,v34)
- The DL is 2-DL.
5Rivests paper
- It assumes that all attributes (including goal
attribute) are binary. - It shows DL is easily learnable from examples.
6Assignment and formula
- Input attributes x1, , xn
- An assignment gives each input attribute a value
(1 or 0) e.g., 10001 - A boolean formula (function) maps each assignment
to a value (1 or 0)
7- Two formulae are equivalent if they give the same
value for same input. - Total number of different formulae
- ? Classification problem learn a formula given a
partial table
8CNF an DNF
- Literal
- Term conjunction (and) of literals
- Clause disjunction (or) of literals
- CNF (conjunctive normal form) the conjunction of
clauses. - DNF (disjunctive normal form) the disjunction of
terms. - k-CNF and k-DNF
9A slightly different definition of DT
- A decision tree (DT) is a binary tree where each
internal node is labeled with a variable, and
each leaf is labeled with 0 or 1. - k-DT the depth of a DT is at most k.
- A DT defines a boolean formula look at the paths
whose leaf node is 1. - An example
10Decision list
- A decision list is a list of pairs
- (f1, v1), , (fr, vr),
- fi are terms, and frtrue.
- A decision list defines a boolean function
- given an assignment x, DL(x)vj, where j is
the least index s.t. fj(x)1.
11Relations among different representations
- CNF, DNF, DT, DL
- k-CNF, k-DNF, k-DT, k-DL
- For any k lt n, k-DL is a proper superset of the
other three. - Compared to DT, DL has a simple structure, but
the complexity of the decisions allowed at each
node is greater.
12k-CNF and k-DNF are proper subsets of k-DL
- k-DNF is a subset of k-DL
- Each term t of a DNF is converted into a decision
rule (t, 1). - k-CNF is a subset of k-DL
- Every k-CNF is a complement of a k-DNF k-CNF and
k-DNF are duals of each other. - The complement of a k-DL is also a k-DL.
- Neither k-CNF nor k-DNF is a subset of the other
- Ex 1-DNF
13K-DT is a proper set of k-DL
- K-DT is a subset of k-DNF
- Each leaf labeled with 1 maps to a term in
k-DNF. - K-DT is a subset of k-CNF
- Each leaf labeled with 0 maps to a clause in
k-CNF - ? k-DT is a subset of
14K-DT, k-CNF, k-DNF and k-DT
k-CNF
k-DNF
k-DT
K-DL
15Learnability
- Positive examples vs. negative examples of the
concept being learned. - In some domains, positive examples are easier to
collect. - A sample is a set of examples.
- A boolean function is consistent with a sample if
it does not contradict any example in the sample.
16Two properties of a learning algorithm
- A learning algorithm is economical if it requires
few examples to identify the correct concept. - A learning algorithm is efficient if it requires
little computational effort to identify the
correct concept. - ? We prefer algorithms that are both economical
and efficient.
17Hypothesis space
- Hypothesis space F a set of concepts that are
being considered. - Hopefully, the concept being learned should be in
the hypothesis space of a learning algorithm. - The goal of a learning algorithm is to select the
right concept from F given the training data.
18- Discrepancy between two functions f and g
- Ideally, we want to be as small as
possible. - To deal with bad luck in drawing example
according to Pn, we define a confidence
parameter
19Polynomially learnable
- A set of Boolean functions is polynomially
learnable if there exists an algorithm A and a
polynomial function - when given a sample of f of size
-
- drawn according to Pn, A will with probability
at least output a
s.t. - Furthermore, As running time is polynomially
bounded in n and m. - K-DL is polynomially learnable.
20How to build a decision list
- Decision tree ? Decision list
- Greedy, iterative algorithm that builds DLs
directly.
21The algorithm in (Rivest, 1987)
- If the example set S is empty, halt.
- Examine each term of length k until a term t is
found s.t. all examples in S which make t true
are of the same type v. - Add (t, v) to decision list and remove those
examples from S. - Repeat 1-3.
22The general greedy algorithm
- RuleList, Etraining_data
- Repeat until E is empty or gain is small
- f Find_best_feature(E)
- Let E be the examples covered by f
- Let c be the most common class in E
- Add (f, c) to RuleList
- EE E
23Problem of greedy algorithm
- The interpretation of rules depends on preceding
rules. - Each iteration reduces the number of training
examples. - Poor rule choices at the beginning of the list
can significantly reduce the accuracy of DL
learned. - ? Several papers on alternative algorithms
24Summary of (Rivest, 1987)
- Formal definition of DL
- Show the relation between k-DL, k-CNF, k-DNF and
k-DL. - Prove that k-DL is polynomially learnable.
- Give a simple greedy algorithm to build k-DL.
25Outline
- Basic concepts and properties
- Case study
26In practice
- Input attributes and the goal are not necessarily
binary. - Ex the previous word
- A term ? a feature (it is not necessarily a
conjunction of literals) - Ex the word appears in a k-word window
- Only some feature types are considered, instead
of all possible features - Ex previous word and next word
- Greedy algorithm quality measure
- Ex a feature with minimum entropy
27Case study accent restoration
- Task to restore accents in Spanish and French
- ? A special case of WSD
- Ex ambiguous de-accented forms
- cesse ? cesse, cessé
- cote ?côté, côte, cote, coté
- Algorithm build a DL for each ambiguous
de-accented form e.g., one for cesse, another
one for cote - Attributes words within a window
28The algorithm
- Training
- Find the list of de-accent forms that are
ambiguous. - For each ambiguous form, build a decision list.
- Testing check each word in a sentence
- if it is ambiguous,
- then restore the accent form according to the DL
29Step 1 Identify forms that are ambiguous
30Step 2 Collecting training context
Context the previous three and next three
words. Strip the accents from the data. Why?
31Step 3 Measure collocational distributions
Feature types are pre-defined.
32Collocations
33Step 4 Rank decision rules by log-likelihood
There are many alternatives.
word class
34Step 5 Pruning DLs
- Pruning
- Cross-validation
- Remove redundant rules WEEKDAY rule precedes
domingo rule.
35Building DL
- For a de-accented form w, find all possible
accented forms - Collect training contexts
- collect k words on each side of w
- strip the accents from the data
- Measure collocational distributions
- use pre-defined attribute combination
- Ex -1 w, 1w, 2w
- Rank decision rules by log-likelihood
- Optional pruning and interpolation
36Experiments
Prior (baseline) choose the most common form.
37Global probabilities vs. Residual probabilities
- Two ways to calculate the log-likelihood
- Global probabilities using the full data set
- Residual probabilities using the residual
training data - More relevant, but less data and more expensive
to compute. - Interpolation use both
- In practice, global probability works better.
38Combining vs. Not combining evidence
- Each decision is based on a single piece of
evidence. - Run-time efficiency and easy modeling
- It works well, at least for this task, but why?
- Combining all available evidence rarely produces
different results - The gross exaggeration of prob from combining
all of these non-independent log-likelihood is
avoided
39Summary of case study
- It allows a wider context (compared to n-gram
methods) - It allows the use of multiple, highly
non-independent evidence types (compared to
Bayesian methods) - ? kitchen-sink approach of the best kind
40Advance topics
41Probabilistic DL
- DL a rule is (f, v)
- Probabilistic DL a rule is
- (f, v1/p1 v2/p2 vn/pn)
42Entropy of a feature q
fired
not fired
43Algorithms for building DL
- AQ algorithm (Michalski, 1969)
- CN2 algorithm (Clark and Niblett, 1989)
- Segal and Etzioni (1994)
- Goodman (2002)
-
44Summary of decision list
- Rules are easily understood by humans (but
remember the order factor) - DL tends to be relatively small, and fast and
easy to apply in practice. - DL is related to DT, CNF, DNF, and TBL.
- Learning greedy algorithm and other improved
algorithms - Extension probabilistic DL
- Ex if A B then (c1, 0.8) (c2, 0.2)