Title: Fatal Errors
1 Data Preparation
- Fatal Errors
- Judgmental Adjustment
- The Data Preparation Steps
- Event Checking
- Assumption Validation
2Fatal ErrorsThese are errors that render the
data USELESSThey cannot be compensated for by
selection of a particular statistical modeling
methodology. If they cannot be corrected, no
analysis is POSSIBLE.
- Non-Representative Sample
- Errors in the Data
3How far Back do we go????
This is the central question. Remember , we are
discussing regression methods. Most simply
stated we want to project forward given the data
that we collected about the past. Therefore, we
want to go back far enough to have a reasonable
amount of data to develop a model but not so far
back that he data is so far removed from the
future that we are projecting into a future that
it is wholly irrelevant. Needless to say that
this problem not have a hard and fast solution.
4How Much et How Far Back
Generally, you need at least 15 data points.
After that it depends upon how far ahead you want
to forecast. A reasonable heuristic is that you
would like to add 6 data points for each period
ahead. This gives reasonable control over the
confidence intervals which tend to bow-outward
rather rapidly as one projects into the future.
So if you want to project out 3 periods you
probably need 15 plus 6 times 3 or 33
historical data points
5Series modification
This is a simple idea and also a difficult topic
to explain by using rules since it is essentially
a heuristic. To get a handle on the idea and to
see how it works let us look at the CandA
downloads and follow along with their analysis
using some of our series. We will re-look at the
data preparation steps a bit more closely when we
examine the regression model in a few weeks. CA1
and CA2
6ANOVA
- The basic idea here is that one has a few time
series with different measures of error. The
question is are the various model different with
respect to these error measures. The way that
this gets answered is to look at the MCT for the
one/way AVOVA. Let us take a look a this idea and
then we will work an example. ANOVA
7Flawed or Maverick Data
- The idea here is to see if the data is recorded
correctly and it follows the modeling
assumptions. This can be very time consuming and
difficult to do. What works for me is the
CC-Plot, the Box Plot and the Malhalanobis Plot.
For the CC plot we look at the plot of the data
to see if anything looks a-skew. The B-Plot
screens variable in CC space while the M-Plot
works in correlation space. They are very
similar. A good heuristic is that if a point
appears on two of the three or if it is on the
B-Plot and the M-Plot then think about removing
it from the analysis.
8The M-Plot
- A note here regarding the M-Plot. It is a
correlation space model. So when we have a
economic regression i.e. a Y and some Xs we
should use the M-Plot. For time series there is
no M-Plot because we only have one series to work
with, in this case we relay on the CC-Plot, The
Box Plot and the feature identification
information.
9Here we simply look for data points that are far
from the norm. we can identify these points by
pointing the arrow and the JMP program indicates
the point. This will then enable you to go back
into the data set and verify if the data point is
is a data set that is
Accurately Recorded
and
Representative
10If the identified outlyer is not accurately
recorded then change its value accordingly. If it
was accurately recorded but not representative of
a normal data point then take the average of it
and the two adjacent points on either side of the
point in question. If it is a representative
point, sometimes it can be, then leave it as it
isyou will just have to live with it. The result
will be wide confidence intervals. Worse things
can happen!!!
11The Event Check
Good analysis depends 100 percent on relevant
data. Now that we have screened the data we need
to as it there have been events that have
rendered the data for some period of time
a-typical. There can be MANY such events. One
classic is the WTC attack. Most analysts look
carefully at the time period just after the
attackcalled the 911 check. We do this by asking
if the mean of the data is not statistically
different between the two periods 911 and
non-911. Let me show you how this works in JMP
using series 2. Assume that the first 13 points
were the production values during the period
before 9-11 and the rest at and after 9-11.
12Residual Analysis
- This is the last analysis step. Really this is a
modeling check but I like to discuss it as part
of the data preparation. Simply it is the check
on the randomness of the residuals. After we fit
a model then we ask if the residuals are random.
That is the assumption under which we fit the
model as you have learned from from the
regression part of your stat course. The check
that we will use id the Fishers Kappa test. It
is a spectral check on all the lags in the
residual series. The first order lag is the
Durbin-Watson test lag. It is weak and I never
use it. If the p-value from the Fk test is - P-value gt 0,10 no structure problems,
- P-value in the range (0,1 to 0,01) accept the
model with reservation - P-value lt 0,01 the model fails the test.
13Judgmental Adjustment Summary
- The CC, B and M-Plots help screen the data.
- The Event Check help us to work with relevant
data and is just another way to think about
feature identification. We will see that feature
identification pertains to both the Economic
Regression and Time series modeling. - There is one more aspect that very often used
called Winsorizing. Let us consider it next -
14Winsorizing
- There are two ways that Winsorizing is used. We
will look at one aspect now and consider the
other more carefully when we get to the Error
Measures. When we have lots of relevant data
then we use Winsorizing to shave off the extreme
values. The only time I find that I use it is in
doing analyses of daily stock returns where I
have more than 500 observations. I that case I
always use Winsorizing as follows
15Winsorizing I
Winsoring is also called Band width Filtering and
Windowing. It is intended to knock out the
extreme data vales to really control the variance
of the data since regression is parametric and
outliers pump-up the variance. We simply pass the
data thru a symmetric filter that shaves off
extreme values. Simple as this sounds it does
matter how you set the window. I use a heuristic
to do this which is presented in the next slide.
A point of reference for this heuristic. As I
said I only use Winsorizing when I have daily
data. For monthly, quarterly and yearly data I
rely on the data preparation steps to screen
outliers.
16My heuristic steps
- Compute the standard deviation of the total data
sethere I am really talking about daily data
setsi.e. large ones. - 2.) Take 2,25 times the Sd of the
companyremember I am here talking about
security/return data, round it up to something
reasonable and use that as the first Window cut. - 3.) If you screen somewhere in the range of
2 to 5 of the data with the window and the
resultant distribution looks sort of
balancedsymmetric then use that WWindow.
Remember we still are going to use the M-Plot to
further screen the data.
17Winsorizing II
- When I have small data sets I never use THE
Winsorizing as we just examined. However,
Winsorizing also is used to bound the error
computations. To make the analysis reasonable we
bound the error measure so that they are all in
the interval 0,01 to 10,0. We will look at this
computation next week. It is very simple.
18Summary
The CC-Plot, Box-Plot, the M-Plot, the Event
Check, and the Residuals Check are INDESPSABLE.
If you do not do them do not do the analysis.
Winsorizing is very useful for large data sets. I
only use it for daily data. Ok we have cover a
lot of material. Let us now do a data prepartion
work up of one of series 2. For a take home group
quiz I want you to do the same for series 15 and
the Construction data set. Due in one week. One
page. Series 2