Skillslash - PowerPoint PPT Presentation

About This Presentation

Title:

Skillslash

Description:

This is presentation gives information about the Machine learning techniques like Decision tree and Random Forest. This ppt has been created by Skillslash team. – PowerPoint PPT presentation

Number of Views:71

Slides: 42

Provided by: Skillslash

Category: How To, Education & Training

more less

Transcript and Presenter's Notes

Title: Skillslash

1
Outline

Machine Learning An Overview
Machine Learning Techniques
Decision trees
Random forests
Applications
Comparative analysis of tools

2
What is machine learning?

Field of study that gives computers the ability
to learn without being explicitly programmed
-Arthur Samuel,1959 Tom
Mitchel, in his famous book, defines machine
learning as improving performance at a task with
experience

Common Application Points
Product Recommendation
Spam Detection
Web Searching
3
Why machine learning?

Every 2 days we create as much information as we
did from the beginning of time until 2003. Source
Even if we had the human resources to handle this
splurge of data the issue of rising complexity
would remain
Over 90 of all the data in the world was created
in the past 2 years. Source

With rising complexity in Knowledge discovery, it
is getting tricky.
Machine learning comes to our aid here by
automating this entire process.
It adapts to new information which reduces the
need for explicit model re-adjustment.

It is expected that by 2020 the amount of
digital information in existence will have grown
from 3.2 zettabytes today to 40 zettabytes. Source
4
Why Machine Learning(contd.)
Number one on FBIs most wanted list escapes
prison!! What do you do? You write a program to
identify this mans image from security camera
footage across the seven continents
But what happens when the number 2 on that list
escapes. Do you put in the same kind of effort
again?
Of course not !! This is where machine learning
works its magic. We simply train the computer to
perform the task of facial recognition and it
automatically extrapolates for new faces.
Wonderful isnt it?
5
Broad Classifications

Reinforcement Learning
Unsupervised Learning
Supervised Learning

Known output values
Two basic types
Classification Discrete output label
Ex Risk analysis
Regression Continuous output values
Ex Pricing in lemon markets
Few techniques used
Random Forests
Decision Trees
Neural networks
Bayesian classification

Unknown output values
Minimal human expertise available since there is
no way to distinguish b/w dependent and
independent variables
Few technique used
Clustering
Association rule analysis
Hierarchical clustering

Less commonly used
The computer learns on the basis of incentives
Incentives are provided in the form of rewards
and punishments generally
Learner tries to find the optimal outcome by
comparing the rewards gained across all of them

6
Machine Learning vs Statistical Modelling

Focus primarily lies on the prediction outcome
and accuracy
Adapts in real time to based on new data
Small datasets with many attributes can be
handled

Focus is primarily on understanding how factors
affect the outcome
New data requires further manual specification
Certain multidimensional cases cannot be
handled(no of attributes are greater than no of
data points)

Machine Learning
Statistical Modelling

In spite of the minor differences the two
ideologies are fairly similar
Both of them use standard statistical tests
(F-tests, t-tests)
Machine learning is a relatively newer field and
we still have a lot to learn here.

7
Decision trees An Introduction

Decision trees is one of the most popularly used
machine learning techniques used in descriptive
analysis
It is fairly simple and easy to use
It is an inductive learning task
Uses particular facts to make more generalized
conclusions
It provides us with a set of rules that
ultimately govern the final decision
A predictive model based on a branching series of
Boolean tests
It classifies at multiple levels unlike some
other classification techniques

8
Types of Decision trees
Two Broad Classifications

Classification trees
Output label comprises of discrete values
They can be directly entered into the tree based
on branching tests
Impurity measures are commonly used as the
splitting criteria
Some of them are Gini index, Information gain
etc.
Regression trees
Output label has continuous real values
They have to be discretized before being entered
into trees
Mean squared error is used as the basis for
splitting
Commonly returns the mean value of the class as
the final output label

9
Decision Trees

Objective What do I do this weekend? For
something that sounds this simple it actually
isnt
Parents visiting
Result The decision tree provides us with a
rule. EX You go to play tennis when your parents
are not visiting and when its sunny outside
Yes
No
Cinema
Weather
Sunny
Rainy
Windy
Money
Stay in
Play Tennis
Rich
Poor
Root node
Shopping
Cinema
Terminal decision node
10
How to split?

There are multiple ways of deciding which
attribute to chose above others. Impurity
measures are the ones that are used most
frequently and three of them are listed below.
Information Gain
Uses entropy(explained before) as a measure of
impurity
Used in the ID3 algorithm explained previously
Gini index
Measures the divergence in prob distributions of
target attributes values
Used in the popular CART models
Gain ratio
Normalizes the information gain criteria
Used in the C4.5 algorithm(an evolution of the
ID3 process)

Impurity refers to the dissimilarity in the
elements within a node
11
When to stop splitting?

One of the following generally happens to trigger
the stopping criterion
All instances point to a single output label
The maximum tree depth is reached
No of cases in the terminal node is less than the
min no of specified cases for a parent node
Upon splitting the no of cases in the child node
are less than the no of specified cases for child
nodes
The best splitting criteria is below a specified
threshold

12
Popular Decision tree inducers

ID3
Splitting criteria Information Gain
C4.5
Splitting criteria Gain Ratio
CART
Classification and Regression both possible
Splitting criteria Mean square error
(Regression)
Strictly binary trees(i.e. each parent node
splits into exactly two child nodes)
CHAID
P values are used to check significant
differences in target attributes
Target attribute is continuous F-test is used
Target attribute is nominal Pearson-chi squared
test is used
Target attribute is ordinal Likelihood-ratio
test is used
Min threshold is set for the p-value beyond which
splitting stops

13
Over-fitting

A modeling error which occurs when a function is
too closely fit to a limited set of data points.
This happens usually when the model becomes
over-complex and tries to even explain the noise
in the training data. It is one of the major
problems faced in the construction of Decision
trees
Consider error of hypothesis h over
Training data errortrain (h)
Entire distribution D of data errorD (h)
Hypothesis h over-fits training data if there
exists an alternative hypothesis h1 such that
both the below mentioned conditions hold

errortrain (h) lt errortrain (h1) errorD (h) gt
errorD (h1)
14
How do we tackle this problem?

A tree of maximum depth possible is first created
and then we start the process of hacking off
branches from the terminal nodes and go up. This
methodology is called post-pruning .
The obvious question that arises next is
Why grow a tree and then prune it rather than
simply not growing it beyond a certain point?
Fixing a threshold saying a certain reduction in
misclassification error is mandatory for a split
may hinder certain useful subsequent splits from
happening altogether which in turn would
negatively impact the predictive capabilities of
the model

The answer lies in the method of Pruning
15
To have a clearer picture we need to acquaint
ourselves with a few mathematical measures Let
R(T) represent the misclassification error in a
tree T and T represent the number of leaves
in the tree We now define a cost-complexity
measure Ra (T) a complexity measure(i.e. as
the no of leaves increase so does this
measure) The principle behind the pruning
process is to reduce the value of the
cost-complexity measure which ensures that the
tree is simple while at the same time
sufficiently accurate and generalizable. Even
from a business perspective high depth is not
favorable for any analysis owing to growing
complexity.
Ra(T) R(T) aT
16
T1 Tree of maximum depth
T3 Hack off lowest node from T2 to get this
Now that all the subtrees have been obtained we
pass the test data through these trees to obtain
the one with the lowest cost-complexity measure
.That is the best and final tree.
T2 Hack off the branches of the lowest node in
T1 to get this
17
Telecom Example

Objective To understand consumer behavior to
create customer-specific promotional offers
Decision Tree for Mobile Customers
Result The company should incentivize a good
customer to purchase more connections to
transform him into a great customer
18
Decision trees vs Logistic Regression

Two distinct boundaries used to split
One oblique boundary used to split
Logistic regression classifier
Decision tree classifier

Both of them are fairly fast and there is no
distinguishing them here
Logistic regression will work better if there is
a single decision boundary and especially if it
is oblique to the axes
Decision trees work better on more complex
datasets which do not have one fixed underlying
decision boundary but multiple such boundaries.
Higher scalability makes decision trees more
prone to over-fitting

19
Decision Trees

Advantages
Easy to use
Starts returning valuable information really fast
High scalability
Limitations
Highly sensitive to new information
Susceptible to over-fitting
Low accuracy
How do we get past these limitations?

20
Ensemble Methods

Ensemble learning is a machine learning paradigm
where we employ several individual or base
learners to work on the same problem
Individual learners are classified on the basis
of predictive accuracy as Strong and Weak
learners
We then use some aggregation methodology to
improve upon the individual learners
Popular ensemble techniques include Bagging and
Boosting.

21
Bagging

One of the popular ensemble methods
Stands for Bootstrap Aggregation
Random sampling with replacement
2. Implementing a learning algorithm on all
sub-samples
3. Aggregation of predictions by individual
classifiers

Train Sample 1
Train Sample 2
Train Sample 3
Steps outlining the process
22
Random Forests

Random forest (or random forests) is an ensemble
classifier that consists of many decision trees
and outputs the class that is the mode of the
class's output by individual trees
Purpose
Improve prediction accuracy above that of
decision trees
Principle
Increasing diversity in individual decision trees
We ideally want the trees to be uncorrelated
Solution
Randomness in sampling training data ensured by
bootstrapping
Random selection of features to be used in each
tree
No of features used per tree is a small fraction
of the total number
By default, square root of the total number of
features is used

23
Algorithm
This is a three-step fully parallelizable process
Bootstrapping
Forest generation
Aggregation
24
Algorithm Flowchart

Begin
Choose variable subset
For each tree
For each chosen variable
Choose training data subset
Sample data
Stop condition holds at each node
Sort by the variable
Yes
Compute Gini index at each split point
No
Build the next split
Choose the best split
Calculate prediction error
End
25
How do Random Forests work?

The forest has been generated and we seek to
understand how the prediction is made on some new
input vector
The input vector is pushed through all the trees
in the forest
It ends up in one of the several terminal nodes
in each of these trees
Now we need to make a decision as to what each of
those terminal nodes represent in terms of their
output label
Case 1 The terminal node is pure(i.e. it
contains data points will all similar output
labels). In this case the choice is simple as
there is only one output label to choose from and
we choose that.
Case 2 The terminal node is impure. In such
cases the mode output label is chosen as the
predicted output for all elements in that
terminal node.
Finally the mode of all the predictions made by
the individual trees is chosen as the final
prediction of the random forest.

26
Is it really as good as it sounds?
Over-fitting was a major concern. We had to find
a way to get past it.
Why did the decision tree process need to evolve?
Did we resolve that because growing more and
more trees sounds a lot like over-fitting the
model?
27
Over-fitting Menace of Menaces

To understand if we have actually resolved the
issue we will need to define a few indicators
useful for the analysis.
Margin function
It computes the difference b/w the average no of
votes for the right class and the next most
popular class.
I(.) Indicator function for the above
computation
General inference Larger the margin value we can
have more confidence on the classification

28
Generalization error This is the error in
classification for a new data-set. Theorem As
the no of trees keep on increasing for all
sequences the value of PE converges to
This result shows that as the no of trees
grow in a forest, instead of over-fitting the
generalization error of the model starts to
converge to a value thus proving that
over-fitting is not possible due to an increase
in the no of trees.

29
Optimizing the Model

This is one of the most important part of doing
any kind of modelling where we try to make the
model the best fit. The variables the can be
altered are
mtry no of attributes used for individual tree
construction
ntree no of trees to be grown in the forest
There is also the tuneRF function that gives us
the optimum value for mtry using O.O.B error
estimate as an indicator.
As long as the OOB estimates keep showing
significant decreases the function checks for
different values until it reaches an optimal
value for mtry.
As for ntree we are already aware that the error
converges to a limiting value(Theorem discussed).
So we increase the trees until the error
stabilizes.

Key piece of information RandomForest package in
R already has default values for the same and
they happen to work fine more often than not.
30
Impact on Bias/Variance

Bagging has no impact on bias
Bagging reduces variance
Question is Why?
We know that averaging reduces variance meaning
that
Bagging in principle is no different than
averaging since it combines multiple classifiers
to come up with a result indicative of all
classifications

31
An Example
Predicting Customer Churn Rate

Churn refers to the rate at which subscribers of
a particular service unsubscribe that service in
a given time period.
Random forests come in really handy handling
large chunks of transaction data.
Studies indicate that a combination of weighted
and balanced random forests which lead to
Improved Balanced Random Forests(IBRFs) work
pretty efficiently while handling imbalanced data
in cases of churn prediction.(Source)

32
Key Features and Advantages

Provides highly accurate and robust classifiers
Ability to handle large databases
No issues of variable deletion in the case of
large no of attributes
Useful for attribute selection as it gives
individual variable importance measures
Since the entire process is parallelizable it is
extremely fast
Cross-validation is not necessary. O.O.B
estimates are quite accurate
Resistant to over-fitting
Handles missing values automatically and
resistant to outliers

33
Applications Dos and Don'ts

Do use it in cases where there is a time
constraint involved in generating the results
Reason The entire random forest technique is
completely parallelizable as we have mentioned
earlier which means that all the trees are grown
parallelly rather than sequentially making the
process amazingly fast. Therefore in cases where
run time needs to be constrained we opt for
random Forests
Dont use it in cases where the dataset has few
no of data-points.
Reason The process of randomization requires a
fairly larger data-size and does not work
efficiently otherwise
Dont use it in cases when the dataset contains
highly uncorrelated sparse features as is the
case in most text analytics cases
Reason Uncorrelated attributes functioning in
the decision tree increases the bias of these
individual trees thereby rendering the entire
analysis highly error-ridden.

34
Implementation in R

35
Case Study Summary

Aim Understand the relative importance of
different factors affecting the income level of
an individual Data source UCL Machine Learning
Repository(Adult.data)

Approach
There are 13 explanatory variables included in
the analysis.
Random Forest technique is used to do a
predictive analysis and to make a decision on the
influential factors.
I will try to validate the model generated using
standard validation techniques

Results
Important factors include
Age
Relationship
Capital gain
Education
Occupation
Marital status
Hours per week worked
OOB error 17.3

36
THANK YOU
37
A1.1 Splitting criteria

38
A1.2 Out-of-Bag (O.O.B) Error

Train set complement
Train set

Each bootstrap sample not used in tree
construction becomes a test set
OOB estimate is the misclassification error
averaged over all samples
Validation is done on the complement of the
training set

39
A1.3 Confusion Matrix

Accuracy rate (True Positive True Negative)
(Total no of instances)

40
A1.4 ID3 Algorithm

This is a top-down recursive divide and conquer
algorithm

Splitting continues until we reach definitive
output labels
At each level all attributes are checked for the
best one
Information Gain is a popular criteria for
selection

41
Research Papers Findings

A paper An Empirical Comparison of Supervised
Learning Algorithms builds on the results of the
STATLOG project was published by the computer
science department of Cornell University presents
the following findings(Source)
Comparison among ten popular supervised machine
learning algorithms is made including decision
trees, random forests, SVMs, boosted trees,
neural networks and logistic regression
Boosted trees and Random forests come out to be
the best classifiers among all accuracy measures
across all the datasets(11 were used)
Another paper Do we Need Hundreds of Classifiers
to Solve Real World Classification Problems?
published in the Journal of Machine Learning
compares 179 classifiers arising from 17 families
over 121 datasets to come up with the following
finding(Source)
The classifier most likely to be the best is the
Random Forests
The conclusion in my mind is that no one could
ever empirically say that a particular algorithm
is the best. It depends on the problem at hand.
That being said, Random Forests has shown some
real promise when it comes to classification
problems