These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.

About This Presentation

Title:

These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.

Description:

These are general notional tutorial s on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist – PowerPoint PPT presentation

Number of Views:333

Avg rating:3.0/5.0

Slides: 112

Provided by: Mon128

Category:

more less

Transcript and Presenter's Notes

Title: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.

1
These are general notional tutorial slides on
data mining theory and practice from which
content may be freely drawn.

Monte F. Hancock, Jr.
Chief Scientist
Celestech, Inc.

2
Data Mining is the detection, characterization,
and exploitation of actionable patterns in data.
3
Data Mining (DM)

Data Mining (DM) is the principled detection,
characterization, and exploitation of actionable
patterns in data.
It is performed by applying modern mathematical
techniques to collected data in accordance with
the scientific method.
DM uses a combination of empirical and
theoretical principles to Connect Structure to
Meaning by
Selecting and conditioning relevant data
Identifying, characterizing, and classifying
latent patterns
Presenting useful representations and
interpretations to users
DM attempts to answer these questions
What patterns are in the information?
What are the characteristics of these patterns?
Can meaning be ascribed to these patterns
and/or their changes?
Can these patterns be presented to users in a way
that will facilitate their assessment,
understanding, and exploitation?
Can a machine learn these patterns and their
relevant interpretations?

4
DM for Decision Support

Decision Support is all about
enabling users to group information in familiar
ways
controlling complexity by layering results (e.g.,
drill-down)
supporting users changing priorities
allowing intuition to be triggered (Ive seen
this before!)
preserving and automating perishable
institutional knowledge
providing objective, repeatable metrics (e.g.,
confidence factors)
fusing simplifying results
automating alerts on important results (Its
happening again!)
detecting emerging behaviors before they
consummate (Look!)
delivering value (timely-relevant-accurate
results)
helping users make the best choices.

5
DM Provides Intelligent Analytic Functions

Automating pattern detection to characterize
complex, distributed signatures that are worth
human attention and recognize those that are
not.
Associating events that go together but are
difficult for humans to correlate.
Characterizing interesting processes not just
facts or simple events
Detecting actionable anomalies and explaining
what makes them different AND interesting.
Describing contexts from multiple perspectives
with numbers, text and graphics

6
DM Answers Questions Users are Asking

Fusion Level 1 Who/What is Where/When in my
space?
Organize and present facts in domain context
Fusion Level 2 What does it mean?
Has this been seen before? What will happen
next?
Fusion Level 3 Do I care?
Enterprise relevance? What action should be
taken?
Fusion Level 4 What can I do better next time?
Adaptation by pattern updates and retraining
How certain am I?
Quantitative assessment of evidentiary pedigree

7
Useful Data Applications

Accurate identification and classification add
value to raw data by tagging and annotation
(e.g., fraud detection)
Anomaly / normalcy and fusion characterize,
quantify, and assess normalcy of patterns and
trends (e.g., network intrusion detection)
Emerging patterns and evidence evaluation -
capturing institutional knowledge of how events
arise and alerting when they emerge
Behavior association - detection of actions that
are distributed in time space but
synchronized by a common objective
connecting the dots
Signature detection and association detection
characterization of multivariate signals,
symbols, and emissions (e.g., voice recognition)
Concept tagging - reasoning about abstract
relationships to tag and annotate media of all
types (e.g., automated web bots)
Software agents assisting analysts
small-footprint fire-and-forget apps that
facilitate search, collaboration, etc.

8
Some Good Data Mining Analytic
Applications

Help the user focus via unobtrusive automation
Off-load burdensome labor (perform intelligent
searches, smart winnowing)
Post smart triggers/tripwires to data stream
(e.g., anomaly detection)
Help with mission triage (Sort my in-basket!)
Automate aspects of classification and detection
Determine which sets of data hold the most
information for a task
Support construction of ad hoc on-the-fly
classifiers
Provide automated constructs for merging decision
engines (multi-level fusion)
Detect and characterize domain drift (the
rules of the game are changing)
Provide functionality to make best estimate of
missing data
Extract/characterize/employ knowledge
Rule induction from data, develop signatures
from data
Implement reasoning for decision support
High-dimensional visualization
Embed decision explanation capability into
analytic applications
Capture/automate/institutionalize best practice
Make proven analytic processes available to all
Capture rare, perishable human knowledge and put
it everywhere
Generate signature-ready prose reports

9
Things that make hard problems VERY hard

Events of interest occur relatively infrequently
in very large datasets (population imbalance)
Information is distributed in a complex way
across many features (the feature selection
problem)
Collection is hard to task, data are difficult to
prepare for analysis, and are never perfect
(noise in the data, data gaps, coverage gaps)
Target patterns are ambiguous/unknown squelch
settings are brittle (e.g., hard to balance
detection vs. false-alarm rates)
Target patterns change/morph over time and across
operational modes (domain drift, processing
methods becomes stale)

10
Some Key Principles of Information Driven Data
Mining

Right People, Methods, Tools (in that order)
Make no prior assumptions about the problem
(agnostic)
Begin with general techniques that let the data
determine the direction of the analysis (Funnel
Method)
Dont jump to conclusions perform process audits
as needed
Dont be a one widget wonder integrate
multiple paradigms so the strengths of one
compensate for the weaknesses of another
Break the problem into the right pieces (Divide
and Conquer)
Work the data, not the tools, but automate when
possible
Be systematic, consistent, thorough dont lose
the forest for the trees.
Document the work so that it is reproducible
Collaborate to avoid surprises team members,
experts, customer
Focus on the Goal maximum value to the user
within cost and schedule

11
Select Appropriate Machine Reasoners

1.) Classifiers
Classifiers ingest a list of attributes, and
determine into which of finitely many categories
the entity exhibiting these attributes falls.
Automatic object recognition and next-event
prediction are examples of this type of
reasoning.
2.) Estimators
Estimators ingest a list of attributes, and
assign some numeric value to the entity
exhibiting these attributes. The estimation of a
probability or a "risk score" are examples of
this type of reasoning.
3.) Semantic Mappers
Semantic mappers ingest text (structured,
unstructured, or both), and generate a data
structure that gives the "meaning" of the text.
Automatic gisting of documents is an example of
this type of reasoning Semantic mapping
generally requires some kind of domain model.
4.) Planners
Planners ingest a scenario description, and
formulate an efficient sequence of feasible
actions that will move the domain to the
specified goal state.
5.) Associators
Associators sample the entire corpus of
domain data, and identify relationships among
entities. Automatic clustering of data to
identify coherent subpopulations is a simple
example. A more sophisticated example is the
forensic analysis of phone, flight, and financial
records to infer the structure of terrorist
networks.

12
Overcoming Processing Challenges through
Intelligent Automation of Data Conditioning,
Feature Selection, and Source Conformation

Data Quality
Cleanliness, Consistency
Comprehensiveness
Completeness
Correctness
Information Quality
Representative (ground truth)
Timeliness
Salience
Independence
Attributes of Enterprise Problems
New trends
New behavior/event schemes
Non-stationarity
Population imbalance
Inability to act on findings

13
Embedded Knowledge

Principled, domain-savvy synthesis of
circumstantial evidence
Copes well with ambiguous, incomplete, or
incorrect input
Enables justification of results in terms domain
experts use
Facilitates good pedagogical helps
Solves the problem like the man does, and so is
comprehensible to most domain experts.
Degrades linearly in combinatorial domains
Can grow in power with experience
Preserves perishable expertise
Allows efficient incremental upgrade/adjustment/re
purposing

14
Features

A feature is the value assumed by some attribute
of an entity in the domain
(e.g., size, quality, age, color, etc.)
Features can be numbers, symbols, or complex data
objects
Features are usually reduced to some simple form
before modeling is performed.
gtgtgtfeatures are usually single numeric values or
contiguous strings.ltltlt

15
Feature Space

Once the features have been designated, a feature
space can be defined for a domain by placing the
features into an ordered array in a systematic
way.
Each instance of an entity having the given
features is then represented by a single point in
n-dimensional Euclidean space its feature
vector.
This Euclidean space, or feature space for the
domain, has dimension equal to the number of
features.
Feature spaces can be one-dimensional,
infinite-dimensional, or anywhere in between.

16
How do classifiers work?
17
(No Transcript)
18
Machines

Data mining paradigms are characterized by
A concept of operation (CONOP component
structure, I/O, training alg., operation)
An architecture (component type, , arrangement,
semantics)
A set of parameters (weights/coefficients/vigilanc
e parameters)
gtgtgtit is assumed here that parameters are
real numbers.ltltlt
A machine is an instantiation of a data mining
paradigm.
Examples of parameter sets for various paradigms
Neural Networks interconnect weights
Belief Networks conditional probability tables
Kernel-Based-classifiers (SVM, RBF) regression
coefficients
Metric classifiers (K-means) cluster centroids

19
A Spiral Methodology for theData Mining Process
20
The DM Discovery Phase Descriptive Modeling

OLAP
Visualization
Unsupervised learning
Link Analysis/Collaborative Filtering
Rule Induction

21
The DM Exploitation Phase Predictive Modeling

Paradigm selection
Test design
Formulation of meta-schemes
Model construction
Model evaluation
Model deployment
Model maintenance

22
A de facto standard DM Methodology

CRISP-DM (cross-industry standard process for
data mining)
1.) Business Understanding
2.) Data Understanding
3.) Data Preparation
4.) Modeling
5.) Evaluation
6.) Deployment

23
Data Mining Paradigms What does your solution
look like?

Conventional Decision Models -statistical
inference, logistic regression, score cards
Heuristic Models -human expert, knowledge-based
expert systems,
fuzzy logic, decision trees, belief nets
Regression Models -neural networks (all sorts),
radial basis functions,
adaptive logic networks, decision trees, SVM

24
Real-World DM Business Challenges

Complex and conflicting goals
Defining success
Getting buy in
Enterprise data is distributed
Limited automation
Unrealistic expectations

25
Real-World DM Technical Challenges

big data consume space and time
efficiency vs. comprehensibility
combinatorial explosion
diluted information
difficult to develop intuition
algorithm roulette

26
Data Mining Problems What does your domain look
like?

How well is the problem understood?
How "big" is the problem?
What kind of data do we have?
What question are we answering?
How deeply buried in the data is the answer?
How must the answer be presented to the user?

27
1. Business Understanding

How well is the problem understood?

28
How well is the problem understood?

Domain intuition low/medium/high
Experts available?
Good documentation?
DM teams prior experience?
Prior art?
What is the enterprise definition of success?
What is the target environment?
How skillful are the users?
Where are the pitchforks?

29
2. Data Understanding3. Preparing the Data

How "big" is the problem?
What kind of data do we have?

30
DM Aspects of Data Preparation

Data Selection
Data Cleansing
Data Representation
Feature Extraction and Transformation
Feature Enhancement
Data Division
Configuration Management

31
How "big" is the problem?

Number of exemplars (rows)
Number of features (columns)
Number of classes (ground truth)
Cost/schedule/talent (dollars, days, dudes)
Tools (own/make/buy, familiarity, scope)

32
What kind of data do we have?

Feature type nominal/numeric/complex
Feature mix homo/heterogeneous by type
Feature tempo
Fresh/stale
Periodic/sporadic
Synchronous/asynchronous
Feature data quality
Low/high SNR
Few/many gaps
Easy/hard to access
Objective/subjective
Feature information quality
Salience, correlation, localization, conditioning
Comprehensive? Representative?

33
How much data do I need?

Many heuristics
Montes 6MN rule, other similar
Support vectors
Segmentation requirements
Comprehensive
Representative
Consider population imbalance

34
Feature Saliency Tests

Correlation/Independence
Visualization to determine saliency
Autoclustering to test for homogeneity
KL-Principal Component Analysis
Statistical Normalization (e.g., ZSCORE)
Outliers, Gaps

35
Making Feature Sets for Data Mining

Converting Nominal Data to Numeric Numeric
Coding
Converting Numeric data to Nominal Symbolic
Coding
Creating Ground-Truth

36
Information can be Irretrievably Distributed
(e.g., the parity-N problem)

0010100110 1
The best feature set is not necessarily the set
of best features.

37
An example of a Feature Metric

Salience geometric mean of class precisions
an objective measure of the ability of a feature
to distinguish classes
takes class proportion into account
specific to a particular classifier and problem
does not measure independence

38
Nominal to Numeric Coding...one step at a time!
Original Data
Step 1
Step 2
39
Numeric to Nominal Quantization
40
Clusters Usually Mean Something
41
How many objects are shown here? One, seen from
various perspectives!This illustrates the danger
of using ONE METHOD/TOOL/VISUALIZATION!
42
Autoclustering

Automatically find spatial patterns in complex
data
find patterns in data
measure the complexity of the data

43
Differential Analysis

Discover the Difference Drivers Between Groups
Which combination of features accounts for the
observed differences between groups?
Focus research

44
Sensitivity Analysis

Measure the Influence of Individual Features on
Outcomes
Rank order features by salience and independence
Estimate problem difficulty

45
Rule Induction

Automatically find semantic patterns in complex
data
discover rules directly from data
organize raw data into actionable knowledge

46
A Rule Induction Example(using data splits)
47
Rule Induction Example (Data Splits)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
4. Modeling

What question are we answering?
How deeply buried in the data is the answer?
How must the answer be presented to the user?

54
(No Transcript)
55
What question are we answering?

Ground truth type
Nominal
Numeric
Complex (e.g., interval estimate, plan, concept)
Ground truth data quality
Low/high SNR
Few/many gaps
Easy/hard to access
Objective/subjective
Ground truth predictability
Correlation with features
Population balance
Class collisions

56
How deeply buried in the data is the answer?

Solvable by a 1 layer Multi-Layer Perceptron
(easy)
Linearly separable any two classes can be
separated by a hyperplane
Solvable by a 2 layer Multi-Layer Perceptron
(moderate)
Convex hulls of classes overlap, but classes do
not
Solvable by a 3 layer Multi-Layer Perceptron
(hard)
Classes overlap but do not collide
intractable
Data contain class collisions

57
How must the answer be presented to the user?

Forensics
GUI, confidence factors, intervals, justification
Integration
Web-based, Web-enable, dll/sl, fully integrated
Accuracy
correct, confusion matrix, lift chart
Performance
Throughput, ease of use, accuracy, reliability

58
(No Transcript)
59
Text Book Neural Network
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
Knowledge Acquisition

What the Expert says
KE ...and, primates. What evidence makes you
CERTAIN an animal is a primate?
KE Yeah, well, like...If its a land animal
thatll eat anything...but it bears live young
and walks upright,...
KE Any obvious physical characteristics?
EX Uh...yes...and no feathers, of course, or
wings, or any of that... Well, then...then, its
gotta be a primate...yeah.
KE So, ANY animal which is a land-dwelling,
omnivorous, skin-covered, unwinged featherless
biped which bears live young is NECESSARILY a
primate?
EX Yep.
KE Could such an animal, be, say, a fish?
EX No...it couldnt be anything but a primate.

68
What the KE hears

IF
(f1,f2,f3,f4,f5) (land, omni, feathers,
wingless biped, born alive)
THEN
PRIMATE and (not fish, not domestic, not bug, not
germ, not bird)

69
Evaluation

How must the answer be presented to the user?

70
Model Evaluation

Accuracy
Classification accuracy, geometric accuracy
precision/recall
RMS
Lift curve
Confusion matrices
ROI
Speed, space, utility, other

71
Classification Errors

Type I - Accepting an item as a member of a class
when it is actually false a false positive.
Type II - Rejecting an item as a member of a
class when it actually is (true) a false
negative.

72
Model Maintenance

Retraining, stationarity
Generalization (e.g. heteroscedasticity)
Changing the feature set (add/subtract)
Conventional maintenance issues

73
What do we give the user besides an application?

Documentation
Support
Model retraining
New model generation

74
Using a Paradigm Taxonomy to Select a DM Algorithm

Place paradigms into a taxonomy by specifying
their attributes. This taxonomy can be used for
algorithm selection.
First, an example taxonomy.

75
KBES (knowledge-based Expert System)
required intuition high vector
count supported high feature count
supported medium class count
supported medium cost to develop
high schedule to develop high
talent to develop medium, high
tools to develop can be expensive to buy/make
feature types supported
nominal/numeric/complex feature mix
supported homogeneous, heterogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, complex relative
representational power low relative
performance fast, intuitive, robust
relative weaknesses ad hoc relatively simple
class boundaries relative strengths
intuitive easy to provide conclusion
justification
76
MLP (Multi-Layer Perceptron)
required intuition low vector count
supported high feature count
supported medium class count
supported medium cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable uncontrolled
regression relative strengths easy
to build
77
RBF (Radial Basis Function) required
intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable models tend
to be large relative strengths
uncontrolled regression can be mitigated
78
SVM (Support Vector Machines)
required intuition low vector count
supported high feature count
supported high class count
supported two cost to develop
medium schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable can be hard
to train relative strengths minimal
need to enhance features
79
Decision Trees (e.g., CART, BBNs)
required intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported nominal,
numeric feature mix supported
homogeneous, heterogeneous feature
data quality needed need not fill "gaps"
ground truth types supported nominal,
numeric relative representational
power high relative performance
moderately fast relative weaknesses
many "low support" nodes or rules
relative strengths can provide insight into the
domain
80
The taxonomy can be used to match available
paradigms with the characteristics of the data
mining problem to be addressed
81
IF the ground truth is discrete
there aren't too many classes the class
boundaries are simple the number of
features is medium the data are
heterogeneous no comprehensive,
representative data set with GT the
population is unbalanced by class the
domain is well-understood by available experts
conclusion justification is neededTHEN
KBES
82
ELSE IF the ground truth is
numeric there is a medium number of
classes the class boundaries are complex
the number of features is medium the
data are numeric comprehensive,
representative data set tagged with GT the
population is relatively balanced by class
the domain is not well-understood by available
experts conclusion justification is not
neededTHEN MLP
83
ELSE IF the ground truth is numeric or
nominal there is a large number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
RBF
84
ELSE IF the ground truth is numeric or
nominal the number of classes is two
the class boundaries very complex the
number of features is very large the data
are numeric comprehensive, representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
SVM
85
ELSE IF the ground truth is numeric or
nominal there is a medium number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric, nominal, or complex
representative data set tagged with GT
the population is unbalanced by class the
domain is not well-understood by available
experts conclusion justification is
neededTHEN Decision Tree (CART, BBN,
etc.) END IF
86
Common Reasons Data Mining Projects Fail
87
Mistakes can occur in each major element of data
mining practice!

1. Specification of Enterprise Objectives
Defining success
2. Creation of the DM Environment
Understanding and Preparing the Data
3. Data Mining Management
4a,b. Descriptive Modeling and Predictive
Modeling
Detecting and Characterizing Patterns
Building Models
Model Evaluation
Model Deployment
Model Maintenance

88
1. Specification of Enterprise Objectives

Define success
Knowledge acquisition interviews (who, what, how)
Objective measures of performance (enterprise
specific)
Assessment of enterprise process and data
environment
Specification of data mining objectives

89
Specification Mistakes

DM projects require careful management of user
expectations. Choosing the wrong person as
customer interface can guarantee user
disappointment.
(GIGOO Garbage in, GOLD out!)
Since the default assessment of RD type
efforts is failure, not defining success
unambiguously will guarantee failure.

90
2. Creation of the DM Environment

Data Warehouse/Data Mart /Database
Meta data and schemas
Data dependencies
Access paths and mechanisms

91
Environmental Mistakes

Big data require bigger storage. DM efforts
typically work against multiple copies of the
data try 2 or 3 x.
Unwillingness to invest in tools forces data
miners to consume resources building inferior
versions of what could have been purchased more
cheaply.
Get labs and network connections set up quickly.

92
Understanding the Data

Enterprise data survey
Data as a process artifact
Temporal Considerations
Data Characterization
Metadata
Collection paths
Data Metrics and Quality
currency, completeness, correctness, correlation

93
A List of Common Data Problems

Conformation (e.g., a dozen ways to say lat/lon)
Accessibility (distributed, sensitive)
Ground Truth (missing, incorrect)
Outliers (detect/process)
Gaps (imputation scheme)
Time (coverage, periodicity, trends, Nyquist)
Consistency (intra/inter record)
Class collisions (how to adjudicate)
Class population imbalance (balancing)
Coding/quantization

94
Data Understanding Mistakes

Assuming that no understanding of the domain is
needed for a successful DM effort
Temporal infeasibility assuming every type of
data you find in the warehouse will actually be
there when your fielded system needs it.
Ignoring the data conformation problem

95
Data Preparation Mistakes

Improper handling of missing data, outliers
Improper conditioning of data
Trojan Horsing ground truth into the feature
set
Having no plan for getting operational access to
data

96
3. Data Mining Management

Data mining skill mix (who are the DM
practitioners?)
Data mining project planning (RAD vs. waterfall)
Data mining project management
Sample DM project cost/schedule
Dont forget Configuration Management!

97
DM Management Mistakes

Appointing a domain expert as the technical
lead on a DM project virtually guarantees that no
new ground will covered.
Inadequate schedule and/or budget poison the
psychological atmosphere necessary for discovery.
Failure to parallelize work
Allowing planless tinkering
Letting technical people snow you
Failure to conduct process audits

98
Configuration Management

Nomenclature and naming conventions
Documenting the workflow for reproducibility
Modeling Process Automation

99
Configuration Management Mistakes

Not having a configuration management plan
(files, directories, nomenclature, audit trail)
virtually guarantees that any success you have
will be unreproduceable.
Allowing each data miner to establish their own
documentation and auditing procedures guarantees
that no one will understand what anyone else has
done.
Failure to automate configuration management
(e.g., putting annotated experiment scripts in a
log) guarantees that your configuration
management plan will not work.

100
4a. Descriptive Modeling

OLAP (on-line analytical processing)
Visualization
Unsupervised learning
Link/Market Basket Analysis
Collaborative Filtering
Rule Induction Techniques
Logistic Regression

101
4b. Predictive Modeling

Paradigms
Test Design
Meta-Schemes
Model Construction
Model Evaluation
Model Deployment
Model Maintenance

102
Paradigms

Know what they are
Know when to use which
Know how to instantiate them
Know how to validate them
Know how to maintain them

103
Model Construction

Architecture (monolithic, hybrid)
Formulation of Objective Function
Training (e.g., NN)
Construction (e.g., KBES)
Meta Schemes
Bagging
Boosting
Post-process model calibration

104
Modeling Mistakes

The Silver Bullet Syndrome relying entirely
on a single tool/method
Expecting your tools to think for you
Overreliance on visualization
Using tools that you dont understand
Not knowing when to quit (maybe this is just
dirt)
Quitting too soon (I havent dug deep enough)
Picking the wrong modeling paradigm
Ignoring population imbalance
Overtraining
Ignoring feature correlation

105
5. Model Evaluation

Blind Testing
N-fold Cross-Validation
Generalization and Overtraining

106
Model Evaluation Mistakes

Not validating the model
Validating the model on the training data
Not escrowing a holdback set

107
6. Model Deployment

ASP (applications service provider)
API (application program interface)
Other
plug-ins
linked objects
file interface, etc.

108
Model Deployment Mistakes

Not considering the fielded architecture
No user training
Not having any operational performance
requirements (except accuracy)

109
7. Model Maintenance

Retraining
Poor generalization
Heteroscedasticity
Non-stationarity
Overtraining
Changing the problem architecture
Adding/subtracting features
Modifying ground truth
Other

110
Model Maintenance Mistakes

Not having a mechanism, method, and criteria for
tracking performance of the fielded model
Not providing a model retraining capability
No documentation, no support

111
Published byDigital Press, 2001 ISBN
1-555558-231-1
112
(No Transcript)

Write a Comment

User Comments (0)