Fatal Errors - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Fatal Errors

Description:

When we have lots of relevant data then we use Winsorizing to shave off the extreme values. ... the data thru a symmetric filter that shaves off extreme values. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 19
Provided by: edl9
Category:
Tags: errors | fatal | shave

less

Transcript and Presenter's Notes

Title: Fatal Errors


1
Data Preparation
  • Fatal Errors
  • Judgmental Adjustment
  • The Data Preparation Steps
  • Event Checking
  • Assumption Validation

2
Fatal ErrorsThese are errors that render the
data USELESSThey cannot be compensated for by
selection of a particular statistical modeling
methodology. If they cannot be corrected, no
analysis is POSSIBLE.
  • Non-Representative Sample
  • Errors in the Data

3
How far Back do we go????
This is the central question. Remember , we are
discussing regression methods. Most simply
stated we want to project forward given the data
that we collected about the past. Therefore, we
want to go back far enough to have a reasonable
amount of data to develop a model but not so far
back that he data is so far removed from the
future that we are projecting into a future that
it is wholly irrelevant. Needless to say that
this problem not have a hard and fast solution.
4
How Much et How Far Back
Generally, you need at least 15 data points.
After that it depends upon how far ahead you want
to forecast. A reasonable heuristic is that you
would like to add 6 data points for each period
ahead. This gives reasonable control over the
confidence intervals which tend to bow-outward
rather rapidly as one projects into the future.
So if you want to project out 3 periods you
probably need 15 plus 6 times 3 or 33
historical data points
5
Series modification
This is a simple idea and also a difficult topic
to explain by using rules since it is essentially
a heuristic. To get a handle on the idea and to
see how it works let us look at the CandA
downloads and follow along with their analysis
using some of our series. We will re-look at the
data preparation steps a bit more closely when we
examine the regression model in a few weeks. CA1
and CA2
6
ANOVA
  • The basic idea here is that one has a few time
    series with different measures of error. The
    question is are the various model different with
    respect to these error measures. The way that
    this gets answered is to look at the MCT for the
    one/way AVOVA. Let us take a look a this idea and
    then we will work an example. ANOVA

7
Flawed or Maverick Data
  • The idea here is to see if the data is recorded
    correctly and it follows the modeling
    assumptions. This can be very time consuming and
    difficult to do. What works for me is the
    CC-Plot, the Box Plot and the Malhalanobis Plot.
    For the CC plot we look at the plot of the data
    to see if anything looks a-skew. The B-Plot
    screens variable in CC space while the M-Plot
    works in correlation space. They are very
    similar. A good heuristic is that if a point
    appears on two of the three or if it is on the
    B-Plot and the M-Plot then think about removing
    it from the analysis.

8
The M-Plot
  • A note here regarding the M-Plot. It is a
    correlation space model. So when we have a
    economic regression i.e. a Y and some Xs we
    should use the M-Plot. For time series there is
    no M-Plot because we only have one series to work
    with, in this case we relay on the CC-Plot, The
    Box Plot and the feature identification
    information.

9
Here we simply look for data points that are far
from the norm. we can identify these points by
pointing the arrow and the JMP program indicates
the point. This will then enable you to go back
into the data set and verify if the data point is
is a data set that is
Accurately Recorded
and
Representative
10
If the identified outlyer is not accurately
recorded then change its value accordingly. If it
was accurately recorded but not representative of
a normal data point then take the average of it
and the two adjacent points on either side of the
point in question. If it is a representative
point, sometimes it can be, then leave it as it
isyou will just have to live with it. The result
will be wide confidence intervals. Worse things
can happen!!!
11
The Event Check
Good analysis depends 100 percent on relevant
data. Now that we have screened the data we need
to as it there have been events that have
rendered the data for some period of time
a-typical. There can be MANY such events. One
classic is the WTC attack. Most analysts look
carefully at the time period just after the
attackcalled the 911 check. We do this by asking
if the mean of the data is not statistically
different between the two periods 911 and
non-911. Let me show you how this works in JMP
using series 2. Assume that the first 13 points
were the production values during the period
before 9-11 and the rest at and after 9-11.
12
Residual Analysis
  • This is the last analysis step. Really this is a
    modeling check but I like to discuss it as part
    of the data preparation. Simply it is the check
    on the randomness of the residuals. After we fit
    a model then we ask if the residuals are random.
    That is the assumption under which we fit the
    model as you have learned from from the
    regression part of your stat course. The check
    that we will use id the Fishers Kappa test. It
    is a spectral check on all the lags in the
    residual series. The first order lag is the
    Durbin-Watson test lag. It is weak and I never
    use it. If the p-value from the Fk test is
  • P-value gt 0,10 no structure problems,
  • P-value in the range (0,1 to 0,01) accept the
    model with reservation
  • P-value lt 0,01 the model fails the test.

13
Judgmental Adjustment Summary
  • The CC, B and M-Plots help screen the data.
  • The Event Check help us to work with relevant
    data and is just another way to think about
    feature identification. We will see that feature
    identification pertains to both the Economic
    Regression and Time series modeling.
  • There is one more aspect that very often used
    called Winsorizing. Let us consider it next

14
Winsorizing
  • There are two ways that Winsorizing is used. We
    will look at one aspect now and consider the
    other more carefully when we get to the Error
    Measures. When we have lots of relevant data
    then we use Winsorizing to shave off the extreme
    values. The only time I find that I use it is in
    doing analyses of daily stock returns where I
    have more than 500 observations. I that case I
    always use Winsorizing as follows

15
Winsorizing I
Winsoring is also called Band width Filtering and
Windowing. It is intended to knock out the
extreme data vales to really control the variance
of the data since regression is parametric and
outliers pump-up the variance. We simply pass the
data thru a symmetric filter that shaves off
extreme values. Simple as this sounds it does
matter how you set the window. I use a heuristic
to do this which is presented in the next slide.
A point of reference for this heuristic. As I
said I only use Winsorizing when I have daily
data. For monthly, quarterly and yearly data I
rely on the data preparation steps to screen
outliers.
16
My heuristic steps
  • Compute the standard deviation of the total data
    sethere I am really talking about daily data
    setsi.e. large ones.
  • 2.)   Take 2,25 times the Sd of the
    companyremember I am here talking about
    security/return data, round it up to something
    reasonable and use that as the first Window cut.
  • 3.)   If you screen somewhere in the range of
    2 to 5 of the data with the window and the
    resultant distribution looks sort of
    balancedsymmetric then use that WWindow.
    Remember we still are going to use the M-Plot to
    further screen the data.

17
Winsorizing II
  • When I have small data sets I never use THE
    Winsorizing as we just examined. However,
    Winsorizing also is used to bound the error
    computations. To make the analysis reasonable we
    bound the error measure so that they are all in
    the interval 0,01 to 10,0. We will look at this
    computation next week. It is very simple.

18
Summary
The CC-Plot, Box-Plot, the M-Plot, the Event
Check, and the Residuals Check are INDESPSABLE.
If you do not do them do not do the analysis.
Winsorizing is very useful for large data sets. I
only use it for daily data. Ok we have cover a
lot of material. Let us now do a data prepartion
work up of one of series 2. For a take home group
quiz I want you to do the same for series 15 and
the Construction data set. Due in one week. One
page. Series 2
Write a Comment
User Comments (0)
About PowerShow.com