Title: Skillslash
1Outline
- Machine Learning An Overview
- Machine Learning Techniques
- Decision trees
- Random forests
- Applications
- Comparative analysis of tools
-
2What is machine learning?
- Field of study that gives computers the ability
to learn without being explicitly programmed - -Arthur Samuel,1959 Tom
Mitchel, in his famous book, defines machine
learning as improving performance at a task with
experience -
Common Application Points
Product Recommendation
Spam Detection
Web Searching
3Why machine learning?
Every 2 days we create as much information as we
did from the beginning of time until 2003. Source
Even if we had the human resources to handle this
splurge of data the issue of rising complexity
would remain
Over 90 of all the data in the world was created
in the past 2 years. Source
- With rising complexity in Knowledge discovery, it
is getting tricky. - Machine learning comes to our aid here by
automating this entire process. - It adapts to new information which reduces the
need for explicit model re-adjustment.
It is expected that by 2020 the amount of
digital information in existence will have grown
from 3.2 zettabytes today to 40 zettabytes. Source
4Why Machine Learning(contd.)
Number one on FBIs most wanted list escapes
prison!! What do you do? You write a program to
identify this mans image from security camera
footage across the seven continents
But what happens when the number 2 on that list
escapes. Do you put in the same kind of effort
again?
Of course not !! This is where machine learning
works its magic. We simply train the computer to
perform the task of facial recognition and it
automatically extrapolates for new faces.
Wonderful isnt it?
5Broad Classifications
Reinforcement Learning
Unsupervised Learning
Supervised Learning
- Known output values
- Two basic types
- Classification Discrete output label
- Ex Risk analysis
- Regression Continuous output values
- Ex Pricing in lemon markets
- Few techniques used
- Random Forests
- Decision Trees
- Neural networks
- Bayesian classification
- Unknown output values
- Minimal human expertise available since there is
no way to distinguish b/w dependent and
independent variables - Few technique used
- Clustering
- Association rule analysis
- Hierarchical clustering
- Less commonly used
- The computer learns on the basis of incentives
- Incentives are provided in the form of rewards
and punishments generally - Learner tries to find the optimal outcome by
comparing the rewards gained across all of them
6Machine Learning vs Statistical Modelling
- Focus primarily lies on the prediction outcome
and accuracy - Adapts in real time to based on new data
- Small datasets with many attributes can be
handled -
- Focus is primarily on understanding how factors
affect the outcome - New data requires further manual specification
- Certain multidimensional cases cannot be
handled(no of attributes are greater than no of
data points)
Machine Learning
Statistical Modelling
- In spite of the minor differences the two
ideologies are fairly similar - Both of them use standard statistical tests
(F-tests, t-tests) - Machine learning is a relatively newer field and
we still have a lot to learn here.
7Decision trees An Introduction
- Decision trees is one of the most popularly used
machine learning techniques used in descriptive
analysis - It is fairly simple and easy to use
- It is an inductive learning task
- Uses particular facts to make more generalized
conclusions - It provides us with a set of rules that
ultimately govern the final decision - A predictive model based on a branching series of
Boolean tests - It classifies at multiple levels unlike some
other classification techniques
8Types of Decision trees
Two Broad Classifications
- Classification trees
- Output label comprises of discrete values
- They can be directly entered into the tree based
on branching tests - Impurity measures are commonly used as the
splitting criteria - Some of them are Gini index, Information gain
etc. - Regression trees
- Output label has continuous real values
- They have to be discretized before being entered
into trees - Mean squared error is used as the basis for
splitting - Commonly returns the mean value of the class as
the final output label
9Decision Trees
Objective What do I do this weekend? For
something that sounds this simple it actually
isnt
Parents visiting
Result The decision tree provides us with a
rule. EX You go to play tennis when your parents
are not visiting and when its sunny outside
Yes
No
Cinema
Weather
Sunny
Rainy
Windy
Money
Stay in
Play Tennis
Rich
Poor
Root node
Shopping
Cinema
Terminal decision node
10How to split?
- There are multiple ways of deciding which
attribute to chose above others. Impurity
measures are the ones that are used most
frequently and three of them are listed below. - Information Gain
- Uses entropy(explained before) as a measure of
impurity - Used in the ID3 algorithm explained previously
- Gini index
- Measures the divergence in prob distributions of
target attributes values - Used in the popular CART models
- Gain ratio
- Normalizes the information gain criteria
- Used in the C4.5 algorithm(an evolution of the
ID3 process)
Impurity refers to the dissimilarity in the
elements within a node
11When to stop splitting?
- One of the following generally happens to trigger
the stopping criterion - All instances point to a single output label
- The maximum tree depth is reached
- No of cases in the terminal node is less than the
min no of specified cases for a parent node - Upon splitting the no of cases in the child node
are less than the no of specified cases for child
nodes - The best splitting criteria is below a specified
threshold
12Popular Decision tree inducers
- ID3
- Splitting criteria Information Gain
- C4.5
- Splitting criteria Gain Ratio
- CART
- Classification and Regression both possible
- Splitting criteria Mean square error
(Regression) - Strictly binary trees(i.e. each parent node
splits into exactly two child nodes) - CHAID
- P values are used to check significant
differences in target attributes - Target attribute is continuous F-test is used
- Target attribute is nominal Pearson-chi squared
test is used - Target attribute is ordinal Likelihood-ratio
test is used - Min threshold is set for the p-value beyond which
splitting stops
13Over-fitting
- A modeling error which occurs when a function is
too closely fit to a limited set of data points.
This happens usually when the model becomes
over-complex and tries to even explain the noise
in the training data. It is one of the major
problems faced in the construction of Decision
trees - Consider error of hypothesis h over
- Training data errortrain (h)
- Entire distribution D of data errorD (h)
- Hypothesis h over-fits training data if there
exists an alternative hypothesis h1 such that
both the below mentioned conditions hold -
errortrain (h) lt errortrain (h1) errorD (h) gt
errorD (h1)
14How do we tackle this problem?
- A tree of maximum depth possible is first created
and then we start the process of hacking off
branches from the terminal nodes and go up. This
methodology is called post-pruning . - The obvious question that arises next is
- Why grow a tree and then prune it rather than
simply not growing it beyond a certain point? - Fixing a threshold saying a certain reduction in
misclassification error is mandatory for a split
may hinder certain useful subsequent splits from
happening altogether which in turn would
negatively impact the predictive capabilities of
the model
The answer lies in the method of Pruning
15To have a clearer picture we need to acquaint
ourselves with a few mathematical measures Let
R(T) represent the misclassification error in a
tree T and T represent the number of leaves
in the tree We now define a cost-complexity
measure Ra (T) a complexity measure(i.e. as
the no of leaves increase so does this
measure) The principle behind the pruning
process is to reduce the value of the
cost-complexity measure which ensures that the
tree is simple while at the same time
sufficiently accurate and generalizable. Even
from a business perspective high depth is not
favorable for any analysis owing to growing
complexity.
Ra(T) R(T) aT
16T1 Tree of maximum depth
T3 Hack off lowest node from T2 to get this
Now that all the subtrees have been obtained we
pass the test data through these trees to obtain
the one with the lowest cost-complexity measure
.That is the best and final tree.
T2 Hack off the branches of the lowest node in
T1 to get this
17Telecom Example
Objective To understand consumer behavior to
create customer-specific promotional offers
Decision Tree for Mobile Customers
Result The company should incentivize a good
customer to purchase more connections to
transform him into a great customer
18Decision trees vs Logistic Regression
Two distinct boundaries used to split
One oblique boundary used to split
Logistic regression classifier
Decision tree classifier
- Both of them are fairly fast and there is no
distinguishing them here - Logistic regression will work better if there is
a single decision boundary and especially if it
is oblique to the axes - Decision trees work better on more complex
datasets which do not have one fixed underlying
decision boundary but multiple such boundaries. - Higher scalability makes decision trees more
prone to over-fitting
19Decision Trees
- Advantages
- Easy to use
- Starts returning valuable information really fast
- High scalability
- Limitations
- Highly sensitive to new information
- Susceptible to over-fitting
- Low accuracy
- How do we get past these limitations?
20Ensemble Methods
- Ensemble learning is a machine learning paradigm
where we employ several individual or base
learners to work on the same problem - Individual learners are classified on the basis
of predictive accuracy as Strong and Weak
learners - We then use some aggregation methodology to
improve upon the individual learners - Popular ensemble techniques include Bagging and
Boosting.
21Bagging
- One of the popular ensemble methods
- Stands for Bootstrap Aggregation
- Random sampling with replacement
- 2. Implementing a learning algorithm on all
sub-samples - 3. Aggregation of predictions by individual
classifiers
Train Sample 1
Train Sample 2
Train Sample 3
Steps outlining the process
22Random Forests
- Random forest (or random forests) is an ensemble
classifier that consists of many decision trees
and outputs the class that is the mode of the
class's output by individual trees - Purpose
- Improve prediction accuracy above that of
decision trees - Principle
- Increasing diversity in individual decision trees
- We ideally want the trees to be uncorrelated
- Solution
- Randomness in sampling training data ensured by
bootstrapping - Random selection of features to be used in each
tree - No of features used per tree is a small fraction
of the total number - By default, square root of the total number of
features is used
23Algorithm
This is a three-step fully parallelizable process
Bootstrapping
Forest generation
Aggregation
24Algorithm Flowchart
Begin
Choose variable subset
For each tree
For each chosen variable
Choose training data subset
Sample data
Stop condition holds at each node
Sort by the variable
Yes
Compute Gini index at each split point
No
Build the next split
Choose the best split
Calculate prediction error
End
25How do Random Forests work?
- The forest has been generated and we seek to
understand how the prediction is made on some new
input vector - The input vector is pushed through all the trees
in the forest - It ends up in one of the several terminal nodes
in each of these trees - Now we need to make a decision as to what each of
those terminal nodes represent in terms of their
output label - Case 1 The terminal node is pure(i.e. it
contains data points will all similar output
labels). In this case the choice is simple as
there is only one output label to choose from and
we choose that. - Case 2 The terminal node is impure. In such
cases the mode output label is chosen as the
predicted output for all elements in that
terminal node. - Finally the mode of all the predictions made by
the individual trees is chosen as the final
prediction of the random forest.
26Is it really as good as it sounds?
Over-fitting was a major concern. We had to find
a way to get past it.
Why did the decision tree process need to evolve?
Did we resolve that because growing more and
more trees sounds a lot like over-fitting the
model?
27Over-fitting Menace of Menaces
- To understand if we have actually resolved the
issue we will need to define a few indicators
useful for the analysis. - Margin function
- It computes the difference b/w the average no of
votes for the right class and the next most
popular class. - I(.) Indicator function for the above
computation - General inference Larger the margin value we can
have more confidence on the classification
28Generalization error This is the error in
classification for a new data-set. Theorem As
the no of trees keep on increasing for all
sequences the value of PE converges to
This result shows that as the no of trees
grow in a forest, instead of over-fitting the
generalization error of the model starts to
converge to a value thus proving that
over-fitting is not possible due to an increase
in the no of trees.
29Optimizing the Model
- This is one of the most important part of doing
any kind of modelling where we try to make the
model the best fit. The variables the can be
altered are - mtry no of attributes used for individual tree
construction - ntree no of trees to be grown in the forest
- There is also the tuneRF function that gives us
the optimum value for mtry using O.O.B error
estimate as an indicator. - As long as the OOB estimates keep showing
significant decreases the function checks for
different values until it reaches an optimal
value for mtry. - As for ntree we are already aware that the error
converges to a limiting value(Theorem discussed).
So we increase the trees until the error
stabilizes.
Key piece of information RandomForest package in
R already has default values for the same and
they happen to work fine more often than not.
30Impact on Bias/Variance
- Bagging has no impact on bias
- Bagging reduces variance
-
- Question is Why?
- We know that averaging reduces variance meaning
that -
-
- Bagging in principle is no different than
averaging since it combines multiple classifiers
to come up with a result indicative of all
classifications -
31An Example
Predicting Customer Churn Rate
- Churn refers to the rate at which subscribers of
a particular service unsubscribe that service in
a given time period. - Random forests come in really handy handling
large chunks of transaction data. - Studies indicate that a combination of weighted
and balanced random forests which lead to
Improved Balanced Random Forests(IBRFs) work
pretty efficiently while handling imbalanced data
in cases of churn prediction.(Source)
32Key Features and Advantages
- Provides highly accurate and robust classifiers
- Ability to handle large databases
- No issues of variable deletion in the case of
large no of attributes - Useful for attribute selection as it gives
individual variable importance measures - Since the entire process is parallelizable it is
extremely fast - Cross-validation is not necessary. O.O.B
estimates are quite accurate - Resistant to over-fitting
- Handles missing values automatically and
resistant to outliers
33Applications Dos and Don'ts
- Do use it in cases where there is a time
constraint involved in generating the results - Reason The entire random forest technique is
completely parallelizable as we have mentioned
earlier which means that all the trees are grown
parallelly rather than sequentially making the
process amazingly fast. Therefore in cases where
run time needs to be constrained we opt for
random Forests - Dont use it in cases where the dataset has few
no of data-points. - Reason The process of randomization requires a
fairly larger data-size and does not work
efficiently otherwise - Dont use it in cases when the dataset contains
highly uncorrelated sparse features as is the
case in most text analytics cases - Reason Uncorrelated attributes functioning in
the decision tree increases the bias of these
individual trees thereby rendering the entire
analysis highly error-ridden.
34Implementation in R
35Case Study Summary
Aim Understand the relative importance of
different factors affecting the income level of
an individual Data source UCL Machine Learning
Repository(Adult.data)
- Approach
- There are 13 explanatory variables included in
the analysis. - Random Forest technique is used to do a
predictive analysis and to make a decision on the
influential factors. - I will try to validate the model generated using
standard validation techniques
- Results
- Important factors include
- Age
- Relationship
- Capital gain
- Education
- Occupation
- Marital status
- Hours per week worked
- OOB error 17.3
36 THANK YOU
37A1.1 Splitting criteria
38A1.2 Out-of-Bag (O.O.B) Error
Train set complement
Train set
- Each bootstrap sample not used in tree
construction becomes a test set - OOB estimate is the misclassification error
averaged over all samples - Validation is done on the complement of the
training set
39A1.3 Confusion Matrix
- Accuracy rate (True Positive True Negative)
(Total no of instances)
40A1.4 ID3 Algorithm
- This is a top-down recursive divide and conquer
algorithm
- Splitting continues until we reach definitive
output labels - At each level all attributes are checked for the
best one - Information Gain is a popular criteria for
selection
41Research Papers Findings
- A paper An Empirical Comparison of Supervised
Learning Algorithms builds on the results of the
STATLOG project was published by the computer
science department of Cornell University presents
the following findings(Source) - Comparison among ten popular supervised machine
learning algorithms is made including decision
trees, random forests, SVMs, boosted trees,
neural networks and logistic regression - Boosted trees and Random forests come out to be
the best classifiers among all accuracy measures
across all the datasets(11 were used) - Another paper Do we Need Hundreds of Classifiers
to Solve Real World Classification Problems?
published in the Journal of Machine Learning
compares 179 classifiers arising from 17 families
over 121 datasets to come up with the following
finding(Source) - The classifier most likely to be the best is the
Random Forests - The conclusion in my mind is that no one could
ever empirically say that a particular algorithm
is the best. It depends on the problem at hand.
That being said, Random Forests has shown some
real promise when it comes to classification
problems