Title: Regression
1Regression
- dr. János Abonyi
- University of Veszprem
- abonyij_at_fmt.vein.hu
- www.fmt.vein.hu/softcomp/dw
2Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
3Linear Regression Models
Here the Xs might be
- Raw predictor variables (continuous or
coded-categorical) - Transformed predictors (X4log X3)
- Basis expansions (X4X32, X5X33, etc.)
- Interactions (X4X2 X3 )
Popular choice for estimation is least squares
4Least Squares
hat matrix
Often assume that the Ys are independent and
normally distributed, leading to various
classical statistical tests and confidence
intervals
5Evaluating the Model
- Variation Measures
- Coeff. Of Determination
- Standard Error of Estimate
- Test Coefficients for Significance
?
Y
b
b
X
?
?
i
i
0
1
6Variation Measures
Unexplained Sum of Squares (Yi -?Yi)2
Y
Yi
SSE
?
Total Sum of Squares (Yi - Y)2
Y
b
b
X
?
?
i
i
0
1
SST
Explained Sum of Squares (Yi - Y)2
SSR
?
Y
X
X
i
7Coefficient of Determination
- Proportion of Variation Explained by
Relationship Between X Y
0 ? r2 ? 1
Explained
Variation
SSR
2
r
?
?
Total Variation
SST
Ability of equation to fit the data
keep in mind that R2 (and t-stats) represent
correlation, not causation
8Evaluating the Fit of A Regression Line
- Adjusted R2
- R2 will tend to be higher the fewer the data
points one is trying to fit a regression line to - fit of regression line (measured by R2) will
always improve with more explanatory variables - adjusted R2 accounts for different sample sizes
and different number of explanatory variables - Nnumber of observations, Knumber of
coefficients to be estimated (including constant)
9Tutorial
10Too Many Predictors?
When there are lots of Xs, get models with high
variance and prediction suffers. Three
solutions
- Subset selection
- Shrinkage/Ridge Regression
- Derived Inputs
All-subsets, leaps-and-bounds, stepwise, AIC,
BIC, etc.
11Ridge Regression
subject to
Equivalently
This leads to Choose ? by cross-validation.
12effective number of Xs
13The Lasso
subject to
Quadratic programming algorithm needed to solve
for the parameter estimates
q0 var. sel. q1 lasso q2 ridge Learn q?
14(No Transcript)
15Dummy variables
- Dummy variables and interaction terms
- suppose you think men buy more pizzas than women,
for any given level of advertising - want different intercepts for women (a1) and men
(a1a2)
Men
Pizzas per Month
Women
a2
a1
Advertising
if add male dummy, its coefficient represents
intercept differential Q a1 a2(Male)
b(Advertising) a1 is the intercept for women
since when Male0, second term vanishes
16Dummy II.
- suppose you believe that in the absence of
advertising men and women buy the same number of
pizzas, but women respond more to advertising
Pizzas per Month
Women
slope b2
Men
slope b1
Advertising
want to interact advertising with female
dummy women Q a b2ADV men Q a
b1ADV all Q a b1ADV c(ADVFemale) c
measures the difference in slope coefficients
between females and males
17variance of error term is not constant
- violates one of the basic assumptions of
regression analysis, that residuals have constant
variance
Y
X
18- what to do?
- heterosdedasticity could be caused by wrong
functional form so going to nonlinear equation
may help
Y
Y
X
X
19Polynomial model family
- Linear in w ? Reduces to the linear regression
case, but with more variables. - Number of terms grows as DM
20Example Polynomial model
21Generalized linear model
Linear in w ? Reduces to the linear regression
case, but with more variables. Requires good
guess on basis functions hk(x)
22Example Generalized linear model
23Basis Expansions for Linear Models
Here the hms might be
- hm(X)Xm, m1,,p recovers the original model
- hm(X)Xj2 or hm(X) Xj Xk
- hm(X)I(Lm?Xk ?Um),
24knots
25Regression Splines
Bottom left panel uses
Number of parameters (3 regions) X (2 params
per region) - (2 knots X 1 constraint per
knot) 4
26cubic spline
27Cubic Spline
continuous first and second derivatives
Number of parameters (3 regions) X (4 params
per region) - (2 knots X 3 constraints per
knot) 6Knot discontinuity
essentially invisible to the human eye
28Image Source ww.physiol.ucl.ac.uk/fedwards/
ca120neuron.jpg
Introduction to Artificial Neural Network Models
29Definition
Neural Network A broad class of models that mimic
functioning inside the human brain
- There are various classes of NN models.
- They are different from each other depending on
- Problem types Prediction, Classification ,
Clustering - Structure of the model
- Model building algorithm
For this discussion we are going to focus
on Feed-forward Back-propagation Neural
Network (used for Prediction and Classification
problems)
30A bit of biology . . .
Most important functional unit in human brain a
class of cells called NEURON
Hippocampal Neurons Source heart.cbl.utoronto.ca
/ berj/projects.html
Schematic
- Dendrites Receive information
- Cell Body Process information
- Axon Carries processed information to other
neurons
- Synapse Junction between Axon end and
Dendrites of other Neurons
31An Artificial Neuron
Dendrites
Cell Body
Axon
X1
Direction of flow of Information
w1
X2
w2
V f(I)
I
. . .
I w1X1 w2X2 w3X3 wpXp
wp
Xp
- Receives Inputs X1 X2 Xp from other neurons
or environment - Inputs fed-in through connections with weights
- Total Input Weighted sum of inputs from all
sources - Transfer function (Activation function) converts
the input to output - Output goes to other neurons or environment
32Transfer Functions
There are various choices for Transfer /
Activation functions
1
0
Logistic f(x) ex / (1 ex)
Threshold 0 if xlt 0 f(x) 1 if
x gt 1
Tanh f(x) (ex e-x) / (ex e-x)
33ANN Feed-forward Network
A collection of neurons form a Layer
Input Layer - Each neuron gets ONLY one
input, directly from outside
Hidden Layer - Connects Input and Output
layers
Output Layer - Output of each neuron
directly goes to outside
34ANN Feed-forward Network
Number of hidden layers can be
None
One
More
35ANN Feed-forward Network
Couple of things to note
- Within a layer neurons are NOT connected to
each other. - Neuron in one layer is connected to neurons ONLY
in the NEXT layer. (Feed-forward)
- Jumping of layer is NOT allowed
36One particular ANN model
What do we mean by A particular Model ?
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
For an ANN
Algebraic form of f(.) is too complicated to
write down.
- However it is characterized by
- Input Neurons
- Hidden Layers
- Neurons in each Hidden Layer
- Output Neurons
- WEIGHTS for all the connections
Fitting an ANN model Specifying values for
all those parameters
37One particular Model an Example
Model Y f(X1 X2 X3)
Input X1 X2 X3
Output Y
Parameters Example Input Neurons 3 Hidden
Layers 1 Hidden Layer Size 3 Output
Neurons 3 Weights Specified
38Prediction using a particular ANN Model
Input X1 X2 X3
Output Y
Model Y f(X1 X2 X3)
-0.2
0.6
-0.1
0.1
0.7
0.5
0.1
-0.2
Suppose Actual Y 2 Then Prediction Error
(2-0.478) 1.522
39Building ANN Model
How to build the Model ?
Input X1 X2 X3 Output Y Model Y
f(X1 X2 X3)
Input Neurons Inputs 3 Output
Neurons Outputs 1
Architecture is now defined How to get the
weights ???
Given the Architecture There are 8 weights to
decide. W (W1, W2, , W8)
Training Data (Yi , X1i, X2i, , Xpi ) i
1,2,,n Given a particular choice of W, we will
get predicted Ys ( V1,V2,,Vn) They are function
of W. Choose W such that over all prediction
error E is minimized
E ? (Yi Vi) 2
40Training the Model
How to train the Model ?
E ? (Yi Vi) 2
41Back Propagation
Bit more detail on Back Propagation
Each weight Shares the Blame for prediction
error with other weights. Back Propagation
algorithm decides how to distribute the blame
among all weights and adjust the weights
accordingly. Small portion of blame leads to
small adjustment. Large portion of the blame
leads to large adjustment.
E ? (Yi Vi) 2
42Weight adjustment during Back Propagation
Weight adjustment formula in Back Propagation
Vi the prediction for ith observation is a
function of the network weights vector W ( W1,
W2,.) Hence, E, the total prediction error is
also a function of W
E( W ) ? Yi Vi( W ) 2
Gradient Descent Method For every individual
weight Wi, updation formula looks like
Wnew Wold ? ( ?E / ?W) Wold
? Learning Parameter (between 0 and 1)
Another slight variation is also used sometimes
W(t1) W(t) ? ( ?E / ?W) W(t) ? (W(t)
- W(t-1) )
? Momentum (between 0 and 1)
43Geometric interpretation of the Weight adjustment
Consider a very simple network with 2 inputs and
1 output. No hidden layer. There are only two
weights whose values needs to be specified.
E( w1, w2 ) ? Yi Vi(w1, w2 ) 2
- A pair ( w1, w2 ) is a point on 2-D plane.
- For any such point we can get a value of E.
- Plot E vs ( w1, w2 ) - a 3-D surface - Error
Surface - Aim is to identify that pair for which E is
minimum - That means identify the pair for which the
height of the error surface is minimum.
- Gradient Descent Algorithm
- Start with a random point ( w1, w2 )
- Move to a better point ( w1, w2 ) where the
height of error surface is lower. - Keep moving till you reach ( w1, w2 ), where
the error is minimum.
44Crawling the Error Surface
45Training Algorithm
Decide the Network architecture ( Hidden
layers, Neurons in each Hidden Layer)
Decide the Learning parameter and Momentum
Initialize the Network with random weights
Feed forward the I-th observation thru the Net
Compute the prediction error on I-th observation
Back propagate the error and adjust weights
E ? (Yi Vi) 2
Check for Convergence
46Convergence Criterion
When to stop training the Network ?
Ideally when we reach the global minima of the
error surface
We dont
How do we know we have reached there ?
- Suggestion
- Stop if the decrease in total prediction error
(since last cycle) is small. - Stop if the overall changes in the weights (since
last cycle) are small.
Drawback Error keeps on decreasing. We get a
very good fit to training data. BUT The network
thus obtained have poor generalizing power on
unseen data The phenomenon is also known as -
Over fitting of the Training data The network is
said to Memorize the training data. - so that
when an X in training set is given, the
network faithfully produces the corresponding
Y. -However for Xs which the network didnt
see before, it predicts poorly.
47Convergence Criterion
Modified Suggestion Partition the training
data into Training set and Validation set Use
Training set - build the model Validation
set - test the performance of the model on unseen
data
Typically as we have more and more training
cycles Error on Training set keeps on
decreasing. Error on Validation set keeps first
decreases and then increases.
Stop training when the error on Validation set
starts increasing
48Choice of Training Parameters
Learning Parameter and Momentum - needs to be
supplied by user from outside. Should be between
0 and 1 What should be the optimal values of
these training parameters ? - No clear
consensus on any fixed strategy. - However,
effects of wrongly specifying them are well
studied.
Learning Parameter Too big Large leaps in
weight space risk of missing global minima. Too
small - Takes long time to converge to
global minima - Once stuck in local minima,
difficult to get out of it.
Suggestion Trial and Error Try various choices
of Learning Parameter and Momentum See which
choice leads to minimum prediction error
49Wrap Up
- Artificial Neural network (ANN) A class of
models inspired by biological Neurons - Used for various modeling problems Prediction,
Classification, Clustering, .. - One particular subclass of ANNs Feed forward
Back propagation networks - Organized in layers Input, hidden, Output
- Each layer is a collection of a number of
artificial Neurons - Neurons in one layer in connected to neurons in
next layer - Connections have weights
- Fitting an ANN model is to find the values of
these weights. - Given a training data set weights are found by
Feed forward Back propagation algorithm, which is
a form of Gradient Descent Method a popular
technique for function minimization. - Network architecture as well as the training
parameters are decided upon by trial and error.
Try various choices and pick the one that gives
lowest prediction error.
50Instance Based Learning
- Key idea just store all training examples
ltxi,f(xi)gt - Nearest neighbor
- Given query instance xq, first locate nearest
training example xn, then estimate f(xq)f(xn) - K-nearest neighbor
- Given xq, take vote among its k nearest neighbors
(if discrete-valued target function) - Take mean of f values of k nearest neighbors (if
real-valued) f(xq)?i1k f(xi)/k
51Voronoi Diagram
query point qf
nearest neighbor qi
523-Nearest Neighbors
query point qf
3 nearest neighbors
2x,1o
537-Nearest Neighbors
query point qf
7 nearest neighbors
3x,4o
54Nearest Neighbor (continuous)
1-nearest neighbor
55Nearest Neighbor (continuous)
3-nearest neighbor
56Nearest Neighbor (continuous)
5-nearest neighbor
57When to Consider Nearest Neighbors
- Instances map to points in RN
- Less than 20 attributes per instance
- Lots of training data
- Advantages
- Training is very fast
- Learn complex target functions
- Do not loose information
- Disadvantages
- Slow at query time
- Easily fooled by irrelevant attributes
58Locally Weighted Regression
- Give more weight to neighbors closer to the query
point - Kernel function is the function of distance that
is used to determine the weight of each training
example. In other words, the kernel function is
the function K such that wiK(d(xi,xq))
59Kernel Functions
60Distance Weighted k-NN
- Give more weight to neighbors closer to the query
point - f(xq) ?i1k wi f(xi) / ?i1k wi
- where wiK(d(xq,xi))
- and d(xq,xi) is the distance between xq and xi
- Instead of only k-nearest neighbors use all
training examples (Shepards method)
61Distance Weighted NN
K(d(xq,xi)) 1/ d(xq,xi)2
62Distance Weighted NN
K(d(xq,xi)) 1/(d0d(xq,xi))2
63Distance Weighted NN
K(d(xq,xi)) exp(-(d(xq,xi)/?0)2)
64Curse of Dimensionality
- Curse of dimensionality nearest neighbor is
easily misled when instance space is
high-dimensional - One approach
- Stretch j-th axis by weight zj, where z1,,zn
chosen to minimize prediction error - Use cross-validation to automatically choose
weights z1,,zn - Note setting zj to zero eliminates this dimension
alltogether (feature subset selection)
65Distance Weighted Average
- Weighting the data
- f(xq) ?i f(xi) K(d(xi,xq))/ ?i K(d(xi,xq))
- Relevance of a data point (xi,f(xi)) is measured
by calculating the distance d(xi,xq) between the
query xq and the input vector xi - Weighting the error criterion
- E(xq) ?i (f(xq)-f(xi))2 K(d(xi,xq))
- the best estimate f(xq) will minimize the
cost E(q), therefore ?E(q)/?f(xq)0
66Locally Weighted Regression
- Local
- the function is approximated based only on data
near the query point. - Weighted
- the contribution of each training example is
weighted by its distance from the query point. - Regression
- approximating real-valued function
67A Local Approximation
- Method-1 minimize the squared error over k
nearest neighbor - Method-2 minimize the squared error over entire
set D, with weights - Method-3 combine 1,2
-
68Local Linear Models
- Estimate the parameters ?k such that they locally
(near the query point xq) match the training data
either by - weighting the data
- wiK(d(xi,xq))1/2 and transforming
- ziwi xi
- viwi yi
- or by weighting the error criterion
- E ?i1N (xiT ? yi)2 K(d(xi,xq))
- still linear in ? with LSQ solution
- ? ((WX)T WX)-1 (WX)T WF(X)
69Design Issues in Local Regression
- Local model order (constant, linear, quadratic)
- Distance function d
- feature scaling d(x,q)(?j1d mj(xj-qj)2)1/2
- irrelevant dimensions mj0
- kernel function K
- smoothing parameter bandwidth h in K(d(x,q)/h)
- hm global bandwidth
- h distance to k-th nearest neighbor point
- hh(q) depending on query point
- hhi depending on stored data points
70Remarks on Locally Weighted Regression
- In most cases, the target function is
approximated by a constant, linear, or quadratic
function - More complex functional forms are not used
because - The cost of fitting more complex functions for
each query instance is high. - These simple approximations model the target
functions quite well over a sufficiently small
subregion of the instance space.
71RBF Networks
H hidden layer radial basis functions
i1,2,d j1,2,,H k1,2,c
d inputnodes
x1
c outputnodes
?
z1
x2
...
..
Wkj
netk
Wji
zk
....
..
?
yj
Linear act. function
zc
x(d-1)
?
xd
? spread constant
72RBFN Principle of Operation
Using Gaussian radial basis functions
Using sigmoidal radial basis functions
73Radial Basis Function Network
- where xu instance from X,
- Ku(d(xu,x)) kernel function
- One common choice for Ku(d(xu,x)) is
74Training Radial Basis Function Networks
- Q-1 What xu to use for each kernel function
Ku(d(xu,x)) ? - Scatter uniformly throughout instance space
- Or use training instances (reflect instance
distribution) - Q-2 How to train weights (assume here Gaussian
Ku) ? - First choose variance (and perhaps mean) for each
Ku (e.g. use EM) - Then hold Ku fixed, and train linear output layer
- Efficient methods to fit linear function
75Training Radial Basis Function Networks
- Training
- construct kernel function
- adjust weights
- RBF networks provide a global approximation to
the target function, represented by a linear
combination of many local kernel functions.
76Local Linear Models
77Linear Local Model Example
78Linear Local Model Example
79Tree-Based Methods
- Overview
- Principle behind Divide and conquer
- Variance will be increased
- Finesse the curse of dimensionality with the
price of mis-specifying the model - Partition the feature space into a set of
rectangles - For simplicity, use recursive binary partition
- Fit a simple model (e.g. constant) for each
rectangle - Classification and Regression Trees (CART)
- Regress Trees
- Classification Trees
- Hierarchical Mixture Experts (HME)
80CART
- An example (in regression case)
81Regression Trees
- Partition the space into M regions R1, R2, ,
RM.
82How CART Sees An Elephant
It was six men of Indostan To learning much
inclined, Who went to see the Elephant (Though
all of them were blind), That each by
observation Might satisfy his mind . -- The
Blind Men and the Elephant by John Godfrey Saxe
(1816-1887)
83Regression Trees Grow the Tree
- The best partition to minimize the sum of
squared error - Finding the global minimum is computationally
infeasible - Greedy algorithm at each level choose variable j
and value s as - The greedy algorithm makes the tree unstable
- The error made at the upper level will be
propagated to the lower level
84Local Linear Model Tree (LOLIMOT)
- incremental tree construction algorithm
- partitions input space by axis-orthogonal splits
- adds one local linear model per iteration
- start with an initial model (e.g. single LLM)
- identify LLM with worst model error Ei
- check all divisions split worst LLM
hyper-rectangle - in halves along each possible dimension
- find best (smallest error) out of possible
divisions - add new validity function and LLM
- repeat from step 2. until termination criteria is
met
85LOLIMOT
Initial global linear model
Split along x1 or x2
Pick split that minimizes model error (residual)
86LOLIMOT Example
87LOLIMOT Example
88Regression Tree how large should we grow the
tree ?
- Trade-off between accuracy and generalization
- Very large tree overfit
- Small tree might not capture the structure
- Strategies
- 1 split only when we can decrease the error
(short-sighted, e.g. XOR) - 2 Cost-complexity pruning (preferred)
89Regression Tree - Pruning
- Cost-complexity pruning
- Pruning collapsing some internal nodes
- Cost complexity
- Choose best alpha weakest link pruning
- Each time collapse an internal node which add
smallest error - Choose from this tree sequence the best one by
cross-validation
90Discussions on Trees
- Linear Combination Splits
- Split the node based on
- Improve the predictive power
- Hurt interpretability