Skillslash - PowerPoint PPT Presentation

About This Presentation
Title:

Skillslash

Description:

This is presentation gives information about the Machine learning techniques like Decision tree and Random Forest. This ppt has been created by Skillslash team. – PowerPoint PPT presentation

Number of Views:71

less

Transcript and Presenter's Notes

Title: Skillslash


1
Outline
  • Machine Learning An Overview
  • Machine Learning Techniques
  • Decision trees
  • Random forests
  • Applications
  • Comparative analysis of tools

2
What is machine learning?
  • Field of study that gives computers the ability
    to learn without being explicitly programmed
  • -Arthur Samuel,1959 Tom
    Mitchel, in his famous book, defines machine
    learning as improving performance at a task with
    experience

Common Application Points
Product Recommendation
Spam Detection
Web Searching
3
Why machine learning?

Every 2 days we create as much information as we
did from the beginning of time until 2003. Source
Even if we had the human resources to handle this
splurge of data the issue of rising complexity
would remain
Over 90 of all the data in the world was created
in the past 2 years. Source
  • With rising complexity in Knowledge discovery, it
    is getting tricky.
  • Machine learning comes to our aid here by
    automating this entire process.
  • It adapts to new information which reduces the
    need for explicit model re-adjustment.

It is expected that by 2020 the amount of
digital information in existence will have grown
from 3.2 zettabytes today to 40 zettabytes. Source
4
Why Machine Learning(contd.)
Number one on FBIs most wanted list escapes
prison!! What do you do? You write a program to
identify this mans image from security camera
footage across the seven continents
But what happens when the number 2 on that list
escapes. Do you put in the same kind of effort
again?
Of course not !! This is where machine learning
works its magic. We simply train the computer to
perform the task of facial recognition and it
automatically extrapolates for new faces.
Wonderful isnt it?
5
Broad Classifications

Reinforcement Learning
Unsupervised Learning
Supervised Learning
  • Known output values
  • Two basic types
  • Classification Discrete output label
  • Ex Risk analysis
  • Regression Continuous output values
  • Ex Pricing in lemon markets
  • Few techniques used
  • Random Forests
  • Decision Trees
  • Neural networks
  • Bayesian classification
  • Unknown output values
  • Minimal human expertise available since there is
    no way to distinguish b/w dependent and
    independent variables
  • Few technique used
  • Clustering
  • Association rule analysis
  • Hierarchical clustering
  • Less commonly used
  • The computer learns on the basis of incentives
  • Incentives are provided in the form of rewards
    and punishments generally
  • Learner tries to find the optimal outcome by
    comparing the rewards gained across all of them

6
Machine Learning vs Statistical Modelling
  • Focus primarily lies on the prediction outcome
    and accuracy
  • Adapts in real time to based on new data
  • Small datasets with many attributes can be
    handled
  • Focus is primarily on understanding how factors
    affect the outcome
  • New data requires further manual specification
  • Certain multidimensional cases cannot be
    handled(no of attributes are greater than no of
    data points)

Machine Learning
Statistical Modelling
  • In spite of the minor differences the two
    ideologies are fairly similar
  • Both of them use standard statistical tests
    (F-tests, t-tests)
  • Machine learning is a relatively newer field and
    we still have a lot to learn here.

7
Decision trees An Introduction
  • Decision trees is one of the most popularly used
    machine learning techniques used in descriptive
    analysis
  • It is fairly simple and easy to use
  • It is an inductive learning task
  • Uses particular facts to make more generalized
    conclusions
  • It provides us with a set of rules that
    ultimately govern the final decision
  • A predictive model based on a branching series of
    Boolean tests
  • It classifies at multiple levels unlike some
    other classification techniques

8
Types of Decision trees
Two Broad Classifications
  • Classification trees
  • Output label comprises of discrete values
  • They can be directly entered into the tree based
    on branching tests
  • Impurity measures are commonly used as the
    splitting criteria
  • Some of them are Gini index, Information gain
    etc.
  • Regression trees
  • Output label has continuous real values
  • They have to be discretized before being entered
    into trees
  • Mean squared error is used as the basis for
    splitting
  • Commonly returns the mean value of the class as
    the final output label

9
Decision Trees

Objective What do I do this weekend? For
something that sounds this simple it actually
isnt
Parents visiting
Result The decision tree provides us with a
rule. EX You go to play tennis when your parents
are not visiting and when its sunny outside
Yes
No
Cinema
Weather
Sunny
Rainy
Windy
Money
Stay in
Play Tennis
Rich
Poor
Root node
Shopping
Cinema
Terminal decision node
10
How to split?
  • There are multiple ways of deciding which
    attribute to chose above others. Impurity
    measures are the ones that are used most
    frequently and three of them are listed below.
  • Information Gain
  • Uses entropy(explained before) as a measure of
    impurity
  • Used in the ID3 algorithm explained previously
  • Gini index
  • Measures the divergence in prob distributions of
    target attributes values
  • Used in the popular CART models
  • Gain ratio
  • Normalizes the information gain criteria
  • Used in the C4.5 algorithm(an evolution of the
    ID3 process)

Impurity refers to the dissimilarity in the
elements within a node
11
When to stop splitting?
  • One of the following generally happens to trigger
    the stopping criterion
  • All instances point to a single output label
  • The maximum tree depth is reached
  • No of cases in the terminal node is less than the
    min no of specified cases for a parent node
  • Upon splitting the no of cases in the child node
    are less than the no of specified cases for child
    nodes
  • The best splitting criteria is below a specified
    threshold

12
Popular Decision tree inducers
  • ID3
  • Splitting criteria Information Gain
  • C4.5
  • Splitting criteria Gain Ratio
  • CART
  • Classification and Regression both possible
  • Splitting criteria Mean square error
    (Regression)
  • Strictly binary trees(i.e. each parent node
    splits into exactly two child nodes)
  • CHAID
  • P values are used to check significant
    differences in target attributes
  • Target attribute is continuous F-test is used
  • Target attribute is nominal Pearson-chi squared
    test is used
  • Target attribute is ordinal Likelihood-ratio
    test is used
  • Min threshold is set for the p-value beyond which
    splitting stops

13
Over-fitting
  • A modeling error which occurs when a function is
    too closely fit to a limited set of data points.
    This happens usually when the model becomes
    over-complex and tries to even explain the noise
    in the training data. It is one of the major
    problems faced in the construction of Decision
    trees
  • Consider error of hypothesis h over
  • Training data errortrain (h)
  • Entire distribution D of data errorD (h)
  • Hypothesis h over-fits training data if there
    exists an alternative hypothesis h1 such that
    both the below mentioned conditions hold

errortrain (h) lt errortrain (h1) errorD (h) gt
errorD (h1)
14
How do we tackle this problem?
  • A tree of maximum depth possible is first created
    and then we start the process of hacking off
    branches from the terminal nodes and go up. This
    methodology is called post-pruning .
  • The obvious question that arises next is
  • Why grow a tree and then prune it rather than
    simply not growing it beyond a certain point?
  • Fixing a threshold saying a certain reduction in
    misclassification error is mandatory for a split
    may hinder certain useful subsequent splits from
    happening altogether which in turn would
    negatively impact the predictive capabilities of
    the model

The answer lies in the method of Pruning
15
To have a clearer picture we need to acquaint
ourselves with a few mathematical measures Let
R(T) represent the misclassification error in a
tree T and T represent the number of leaves
in the tree We now define a cost-complexity
measure Ra (T) a complexity measure(i.e. as
the no of leaves increase so does this
measure) The principle behind the pruning
process is to reduce the value of the
cost-complexity measure which ensures that the
tree is simple while at the same time
sufficiently accurate and generalizable. Even
from a business perspective high depth is not
favorable for any analysis owing to growing
complexity.
Ra(T) R(T) aT
16
T1 Tree of maximum depth
T3 Hack off lowest node from T2 to get this
Now that all the subtrees have been obtained we
pass the test data through these trees to obtain
the one with the lowest cost-complexity measure
.That is the best and final tree.
T2 Hack off the branches of the lowest node in
T1 to get this
17
Telecom Example

Objective To understand consumer behavior to
create customer-specific promotional offers
Decision Tree for Mobile Customers
Result The company should incentivize a good
customer to purchase more connections to
transform him into a great customer
18
Decision trees vs Logistic Regression

Two distinct boundaries used to split
One oblique boundary used to split
Logistic regression classifier
Decision tree classifier
  • Both of them are fairly fast and there is no
    distinguishing them here
  • Logistic regression will work better if there is
    a single decision boundary and especially if it
    is oblique to the axes
  • Decision trees work better on more complex
    datasets which do not have one fixed underlying
    decision boundary but multiple such boundaries.
  • Higher scalability makes decision trees more
    prone to over-fitting

19
Decision Trees
  • Advantages
  • Easy to use
  • Starts returning valuable information really fast
  • High scalability
  • Limitations
  • Highly sensitive to new information
  • Susceptible to over-fitting
  • Low accuracy
  • How do we get past these limitations?

20
Ensemble Methods
  • Ensemble learning is a machine learning paradigm
    where we employ several individual or base
    learners to work on the same problem
  • Individual learners are classified on the basis
    of predictive accuracy as Strong and Weak
    learners
  • We then use some aggregation methodology to
    improve upon the individual learners
  • Popular ensemble techniques include Bagging and
    Boosting.

21
Bagging
  • One of the popular ensemble methods
  • Stands for Bootstrap Aggregation
  • Random sampling with replacement
  • 2. Implementing a learning algorithm on all
    sub-samples
  • 3. Aggregation of predictions by individual
    classifiers

Train Sample 1
Train Sample 2
Train Sample 3
Steps outlining the process
22
Random Forests
  • Random forest (or random forests) is an ensemble
    classifier that consists of many decision trees
    and outputs the class that is the mode of the
    class's output by individual trees
  • Purpose
  • Improve prediction accuracy above that of
    decision trees
  • Principle
  • Increasing diversity in individual decision trees
  • We ideally want the trees to be uncorrelated
  • Solution
  • Randomness in sampling training data ensured by
    bootstrapping
  • Random selection of features to be used in each
    tree
  • No of features used per tree is a small fraction
    of the total number
  • By default, square root of the total number of
    features is used

23
Algorithm
This is a three-step fully parallelizable process
Bootstrapping
Forest generation
Aggregation
24
Algorithm Flowchart

Begin
Choose variable subset
For each tree
For each chosen variable
Choose training data subset
Sample data
Stop condition holds at each node
Sort by the variable
Yes
Compute Gini index at each split point
No
Build the next split
Choose the best split
Calculate prediction error
End
25
How do Random Forests work?
  • The forest has been generated and we seek to
    understand how the prediction is made on some new
    input vector
  • The input vector is pushed through all the trees
    in the forest
  • It ends up in one of the several terminal nodes
    in each of these trees
  • Now we need to make a decision as to what each of
    those terminal nodes represent in terms of their
    output label
  • Case 1 The terminal node is pure(i.e. it
    contains data points will all similar output
    labels). In this case the choice is simple as
    there is only one output label to choose from and
    we choose that.
  • Case 2 The terminal node is impure. In such
    cases the mode output label is chosen as the
    predicted output for all elements in that
    terminal node.
  • Finally the mode of all the predictions made by
    the individual trees is chosen as the final
    prediction of the random forest.

26
Is it really as good as it sounds?
Over-fitting was a major concern. We had to find
a way to get past it.
Why did the decision tree process need to evolve?
Did we resolve that because growing more and
more trees sounds a lot like over-fitting the
model?
27
Over-fitting Menace of Menaces
  • To understand if we have actually resolved the
    issue we will need to define a few indicators
    useful for the analysis.
  • Margin function
  • It computes the difference b/w the average no of
    votes for the right class and the next most
    popular class.
  • I(.) Indicator function for the above
    computation
  • General inference Larger the margin value we can
    have more confidence on the classification

28
Generalization error This is the error in
classification for a new data-set. Theorem As
the no of trees keep on increasing for all
sequences the value of PE converges to
This result shows that as the no of trees
grow in a forest, instead of over-fitting the
generalization error of the model starts to
converge to a value thus proving that
over-fitting is not possible due to an increase
in the no of trees.



29
Optimizing the Model
  • This is one of the most important part of doing
    any kind of modelling where we try to make the
    model the best fit. The variables the can be
    altered are
  • mtry no of attributes used for individual tree
    construction
  • ntree no of trees to be grown in the forest
  • There is also the tuneRF function that gives us
    the optimum value for mtry using O.O.B error
    estimate as an indicator.
  • As long as the OOB estimates keep showing
    significant decreases the function checks for
    different values until it reaches an optimal
    value for mtry.
  • As for ntree we are already aware that the error
    converges to a limiting value(Theorem discussed).
    So we increase the trees until the error
    stabilizes.

Key piece of information RandomForest package in
R already has default values for the same and
they happen to work fine more often than not.
30
Impact on Bias/Variance
  • Bagging has no impact on bias
  • Bagging reduces variance
  • Question is Why?
  • We know that averaging reduces variance meaning
    that
  • Bagging in principle is no different than
    averaging since it combines multiple classifiers
    to come up with a result indicative of all
    classifications

31
An Example
Predicting Customer Churn Rate
  • Churn refers to the rate at which subscribers of
    a particular service unsubscribe that service in
    a given time period.
  • Random forests come in really handy handling
    large chunks of transaction data.
  • Studies indicate that a combination of weighted
    and balanced random forests which lead to
    Improved Balanced Random Forests(IBRFs) work
    pretty efficiently while handling imbalanced data
    in cases of churn prediction.(Source)

32
Key Features and Advantages
  • Provides highly accurate and robust classifiers
  • Ability to handle large databases
  • No issues of variable deletion in the case of
    large no of attributes
  • Useful for attribute selection as it gives
    individual variable importance measures
  • Since the entire process is parallelizable it is
    extremely fast
  • Cross-validation is not necessary. O.O.B
    estimates are quite accurate
  • Resistant to over-fitting
  • Handles missing values automatically and
    resistant to outliers

33
Applications Dos and Don'ts
  • Do use it in cases where there is a time
    constraint involved in generating the results
  • Reason The entire random forest technique is
    completely parallelizable as we have mentioned
    earlier which means that all the trees are grown
    parallelly rather than sequentially making the
    process amazingly fast. Therefore in cases where
    run time needs to be constrained we opt for
    random Forests
  • Dont use it in cases where the dataset has few
    no of data-points.
  • Reason The process of randomization requires a
    fairly larger data-size and does not work
    efficiently otherwise
  • Dont use it in cases when the dataset contains
    highly uncorrelated sparse features as is the
    case in most text analytics cases
  • Reason Uncorrelated attributes functioning in
    the decision tree increases the bias of these
    individual trees thereby rendering the entire
    analysis highly error-ridden.

34
Implementation in R

35
Case Study Summary

Aim Understand the relative importance of
different factors affecting the income level of
an individual Data source UCL Machine Learning
Repository(Adult.data)
  • Approach
  • There are 13 explanatory variables included in
    the analysis.
  • Random Forest technique is used to do a
    predictive analysis and to make a decision on the
    influential factors.
  • I will try to validate the model generated using
    standard validation techniques
  • Results
  • Important factors include
  • Age
  • Relationship
  • Capital gain
  • Education
  • Occupation
  • Marital status
  • Hours per week worked
  • OOB error 17.3

36
THANK YOU
37
A1.1 Splitting criteria

38
A1.2 Out-of-Bag (O.O.B) Error

Train set complement
Train set
  • Each bootstrap sample not used in tree
    construction becomes a test set
  • OOB estimate is the misclassification error
    averaged over all samples
  • Validation is done on the complement of the
    training set

39
A1.3 Confusion Matrix
  • Accuracy rate (True Positive True Negative)
    (Total no of instances)

40
A1.4 ID3 Algorithm
  • This is a top-down recursive divide and conquer
    algorithm
  • Splitting continues until we reach definitive
    output labels
  • At each level all attributes are checked for the
    best one
  • Information Gain is a popular criteria for
    selection

41
Research Papers Findings
  • A paper An Empirical Comparison of Supervised
    Learning Algorithms builds on the results of the
    STATLOG project was published by the computer
    science department of Cornell University presents
    the following findings(Source)
  • Comparison among ten popular supervised machine
    learning algorithms is made including decision
    trees, random forests, SVMs, boosted trees,
    neural networks and logistic regression
  • Boosted trees and Random forests come out to be
    the best classifiers among all accuracy measures
    across all the datasets(11 were used)
  • Another paper Do we Need Hundreds of Classifiers
    to Solve Real World Classification Problems?
    published in the Journal of Machine Learning
    compares 179 classifiers arising from 17 families
    over 121 datasets to come up with the following
    finding(Source)
  • The classifier most likely to be the best is the
    Random Forests
  • The conclusion in my mind is that no one could
    ever empirically say that a particular algorithm
    is the best. It depends on the problem at hand.
    That being said, Random Forests has shown some
    real promise when it comes to classification
    problems
Write a Comment
User Comments (0)
About PowerShow.com