CSCI 548/B480: Introduction to Bioinformatics Fall 2002 - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

CSCI 548/B480: Introduction to Bioinformatics Fall 2002

Description:

to Bioinformatics. Classification A Two-Step Process ... to Bioinformatics '... the metaphor underlying genetic algorithms is that of natural evolution. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 41

Provided by: Jeffre

Category:

more less

Transcript and Presenter's Notes

Title: CSCI 548/B480: Introduction to Bioinformatics Fall 2002

1
CSCI 548/B480 Introduction to
BioinformaticsFall 2002
Topic 5 Machine Intelligence - Learning and
Evolution

Dr. Jeffrey Huang, Assistant Professor
Department of Computer and Information Science,
IUPUI
E-mail huang_at_cs.iupui.edu

2
Machine Intelligence

Machine Learning
The subfield of AI concerned with intelligent
systems that learn.
The computational study of algorithms that
improve performance based on experience.
The attempt to build intelligent entities
We must understand intelligent entities first
Computational Brain
Mathematics
Philosophy staked most of the ideas of AI but to
make it a formal science the mathematical
formalization is needed in
Computation
Logic
Probability

3
Behavior-Based AI vs. Knowledge Based

Definitions of Machine Learning
Reasoning
The effort to make computers think and solve
problem
The study of mental faculties through the use of
computational models
Behavior
Make machines to perform human actions requiring
intelligence
Seeks to explain intelligent behavior in terms of
computational processes
Agents

4
Operational Agents

Operational Views of Intelligence
The ability to perform intellectual tasks
Prove theorems, play chess, solve puzzle
Focus on what goes on between the ears
Emphasize the ability to build and effectively
use mental models
The ability to perform intellectually challenging
real world tasks
Medical diagnosis, tax advising, financial
investing
Introduce new issues such as critical
interactions with the world, model grounding,
uncertainty
The ability to survive, adapt, and function in a
constantly changing world
Autonomous agents
Vision, locomotion, and manipulation, many I/O
issues
Self-assessment, learning, curiosity, etc.

5
Building Intelligent Artifacts

Symbolic Approaches
Construct goal-oriented symbol manipulation
systems
Focus on high end abstract thinking
Non-symbolic approaches
Build performance-oriented systems
Focus on behavior
Need both in tightly coupled form
Difficult in building such systems
Growing need to automate this process
Good approach Evolutionary Algorithms

Behavior-Based AI
Behavior-Based AI vs. Knowledge-Based
"Situated" in environment
Multiple competencies ('routines')
Autonomy
Adaptation and Competition
Artificial Life (A-Life)
Agents Reactive Behavior
Abstracting the logical principles of living
organism
Collective Behavior Competition and Cooperation

7
Classification vs. Prediction

Classification
predicts categorical class labels
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
Prediction
models continuous-valued functions, i.e.,
predicts unknown or missing values

8
ClassificationA Two-Step Process

Model construction describing a set of
predetermined classes
Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
The set of tuples used for model construction
training set
The model is represented as classification rules,
decision trees, or mathematical formulae
Model usage for classifying future or unknown
objects
Estimate accuracy of the model
The known label of test sample is compared with
the classified result from the model
Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
Test set is independent of training set,
otherwise over-fitting will occur

9
Classification Process
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
Use the Model in Prediction
(Jeff, Professor, 2)
Tenured?
10
Supervised vs. Unsupervised Learning

Supervised learning (classification)
Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data

11
Classification and Prediction

Data Preparation
Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Evaluating Classification Methods
Predictive accuracy
Speed and scalability
time to construct the model
time to use the model
Robustness handling noise and missing values
Scalability efficiency in disk-resident
databases
Interpretability understanding and insight
provided by the model
Goodness of rules
decision tree size
compactness of classification rules

12
From Learning to Evolutionary

Optimization
Accomplishing abstract task Solving problem
searching through a space of potential
solution
finding the best solution
? an optimization process
Classical Exhaustive Methods??
Large Space?? Special machine learning technique
Evolution Algorithms
Stochastic Algorithms
Search methods model some phenomena
Genetic Inheritance
Darwinian strife for survival

the metaphor underlying genetic algorithms is
that of natural evolution. In evolution, the
problem each species faces is one of searching
for beneficial adaptations to a complicated and
changing environment. The knowledge that each
species has gained is embodied in the makeup of
chromosomes of its members
- L. David and M. Steenstrup, Genetic Algorithms
and Simulated Annealing, pp. 1-11, Kaufmann,
1987

14
The Essence Components

Genetic representation for potential solutions to
the problem
A way to create an Initial population of
potential solutions
An evaluation function that plays the ole of the
environment, rating solutions in term of their
fitness
i.e. the use of fitness to determine survival
and reproductive rates
Genetic operators that alter the composition of
children

15
Evolutionary Algorithm Search Procedure
16
Historical Background

Three paradigms emerged in the 1960s
Genetic Algorithms
Introduced by Holland (MSU) ? De Jong (GMU)
Envisioned for broad range of adaptive systems
Evolution Strategies
Introduced by Rechenberg
Focused on real-valued parameter optimization
Evolutionary Programming
Introduced by Fogel and Koza
Applied to AI and machine learning problem
Today
Wide variety of evolutionary algorithms
Applied to many area of science and engineering

17
Examples of Evolutionary AI

Parameter Tuning
Pervasiveness of parameterized models
Complex behavioral changes due to non-linear
interactions
Example
Weights of an Artificial Neural networks
Parameters of a heuristic evolution function
Parameter of a rule induction system
Parameter of membership functions
Goal evolve over time useful set of discrete/
continuous parameter

Evolving Structure
Effect behavior change via more complex
structures
Example
Selecting/constructing the topology of ANNs
Selecting/constructing the feature sets
Selecting/constructing plans/scenarios
Selecting/constructing membership functions
Goal evolve useful structure over time
Evolving Programs
Goal acquire new behaviors and adapt existing
ones
Example
Acquire/adapt behavioral rules sets
Acquire/adapt arm/joint control programs
Acquire/adapt task-oriented programming code

19
How Does Genetic Algorithm Work?

A simple example of function optimization
Find max f(x)x2, for x? 0, 4
Representation
Genotype (chromosome) internally points in the
search space are represented as (binary) string
over some alphabet
Phenotype the expressed traits of an individual
With a precision for x in 0,4 of 10-4 it
needs14 bits
8,000 ? 213 lt 10,000 lt 214 ? 16,000
Simple fixed length binary
Assigned 0.0 to the string 00 0000 0000 0000
Assign 0.0 bin2dec(binary string)4/(214 -1)
the string 00 0000 0000 0001 and so on
Phenotype 4.0 genotype 11 1111 1111 1111

20
00000000000000 00000000000001 11111111111111
0.0 4/(214 -1) 4.0
genotype
Phenotype

Initial population
Create a population (pop_size) of chromosomes,
where each chromosome is a binary vector of 14
bits
All 14 bits for each chromosome are initialized
randomly
Evaluation function
Evaluation function eval for binary vectors v is
equal to the function f
eval(v) f(x)
ex eval(v1) f(x1) fitness1

Parameters
pop_size 24,
Prob. of Xover, pc 0.6,
Prob. of mutation, pm 0.01
Recombination using genetic operations
Crossover (pc)
v1 01111100010011 gt v1 01110101011100
v2 00010101011100 gt v2 00011100010011
Mutation (pm)
v2 00011100010011 gt v2 00011110010011

Selection M(t) from M(t1) using roulette wheel
Total fitness of the population
Probability of selection probi for each
chromosome vi
Cumulative prob qi
Generate random numbers rj, from 0,1, where j
1pop_size
Select chromosome vi such that qi-1 lt rj lt qi

23
(No Transcript)
24
Homing to the Optimal Solution
25
Best-so-far Curve
26
Optimal Feature Subset

Search for the Subsets of Discriminatory Features
Combination optimization problem
Two general approaches to identifying optimal
subsets of features
Abstract measurement for important properties of
good feature sets
Orthogonality (ex. PCA), information content, low
variance
Less expensive process
Fall in suboptimal performance if the abstract
measures do not correlate well with actual
performance
Building a classifier from the feature subset and
evaluating its performance on actual
classification tasks.
Better classification performance
the cost of building and testing classifiers
prohibits any kind of systematic evaluation of
feature subsets
suboptimal in practice large numbers of
candidate features cannot be handled by any form
of systematic search
2N possible candidate subsets of N features.

27
Inductive Learning

Learning From Examples
Decision Tree (DT)
Information Theory (IT)
Question what are the BEST attributes
(Features) for building the decision tree?
Answer BEST attribute is the one that it is
MOST informative and for whom
ambiguity/uncertainty is least
Solution Measure (information) contents using
the expected amount of information provided by
the attribute

28
Classification by Decision Tree Induction

Decision tree
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Decision tree generation consists of two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Use of decision tree Classifying an unknown
sample
Test the attribute values of the sample against
the decision tree

Exs. Class Size Color Surface
1 A Small Yellow Smooth
2 A Medium Red Smooth
3 A Medium Red Smooth
4 A Big Red Rough
5 B Medium Yellow Smooth
6 B Medium Yellow Smooth
29

Entropy
Define an entropy function H such that
where pi the probability associated with ith
class
For a feature, the entropy is calculated for each
value.
The sum of the entropy weighted by the
probability of each value is the entropy for that
feature
Example Toss a fair coin
if the coin is not fair, i.e. Pheads 99,
then
So, by tossing the coin you get very little
(extra) information (that you didnt expect)

In general, if you have p positive examples, and
n negative examples
For p n ? H 1
i.e. originally there is most uncertainty on the
eventual outcome (picking up an example) and most
to gain by picking the example.

31
Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning
Majority voting is employed for classifying the
leaf
There are no samples left

32
Algorithm

Select a random subset W (called the window) from
the training set T
Build a DT for the current W
Select the best feature which minimizes the
entropy H (or max. gain)
Categorize training instances (examples) into
subsets by this feature
Repeat this process recursively until each subset
contains instances of one kind (class) or some
statistical criterion is satisfied
Scan the entire training set for exceptions to
the DT
If exceptions are found insert some of them into
W and repeat from step 2

Information Gain
The information gain from the ? attribute test is
defined as the difference between the original
information requirement and the new requirement
Note that the Remainder(?) is an weighted (by
attribute values) entropy function
Maximize Gain(?) ? Minimize Remainder(?) and
then ? is the most informative attribute
(question)

34
The ID3 Algorithm and Quinlans C4.5

C4.5
Tutorial http//yoda.cis.temple.edu8080/UGAIWWW/
lectures/C45/
Matlab program http//www.cs.wisc.edu/olvi/uwmp/
msmt.html
See 5/ C5.0
Tutorial http//borba.ncc.up.pt/niaad/Software/c5
0/c50manual.html
Software for Win2000 http//www.rulequest.com/dow
nload.html

35
Exs. Class Size Color Surface
1 A Small Yellow Smooth
2 A Medium Red Smooth
3 A Medium Red Smooth
4 A Big Red Rough
5 B Medium Yellow Smooth
6 B Medium Yellow Smooth

Example

Noise and Overfitting
Question what about two or more examples with
the same description but different
classifications?
Answer Each leaf node reports either MAJORITY
classification or relative frequencies
Question what about irrelevant attributes (noise
and overfitting)?
Answer Tree pruning
Solution An information gain close to zero is a
good clue to irrelevance, actual number of ()
and (-) exs. In each subset i, pi and ni vs.
expected numbers pi and ni assuming true
irrelevance
Where p and n are the total number of positive
and negative exs to start with.
Total deviation (regarding statistical
significant)
Under the null hypothesis, D chi-squared
distribution

37
Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer yes
IF age gt40 AND credit_rating excellent
THEN buys_computer yes
IF age gt40 AND credit_rating fair THEN
buys_computer no

38
Decision Tree

Avoid Overfitting in Classification
The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees
Use a set of data different from the training
data to decide which is the best pruned tree

Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross
validation
Use all the data for training
but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution
Use minimum description length (MDL) principle
halting growth of the tree when the encoding is
minimized

40
Decision Tree

Enhancements to basic decision tree induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that
are sparsely represented
This reduces fragmentation, repetition, and
replication