These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn. - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.

Description:

These are general notional tutorial s on data mining theory and practice from which content may be freely drawn. Monte F. Hancock, Jr. Chief Scientist – PowerPoint PPT presentation

Number of Views:333
Avg rating:3.0/5.0
Slides: 112
Provided by: Mon128
Category:

less

Transcript and Presenter's Notes

Title: These are general notional tutorial slides on data mining theory and practice from which content may be freely drawn.


1
These are general notional tutorial slides on
data mining theory and practice from which
content may be freely drawn.
  • Monte F. Hancock, Jr.
  • Chief Scientist
  • Celestech, Inc.

2
Data Mining is the detection, characterization,
and exploitation of actionable patterns in data.
3
Data Mining (DM)
  • Data Mining (DM) is the principled detection,
    characterization, and exploitation of actionable
    patterns in data.
  • It is performed by applying modern mathematical
    techniques to collected data in accordance with
    the scientific method.
  • DM uses a combination of empirical and
    theoretical principles to Connect Structure to
    Meaning by
  • Selecting and conditioning relevant data
  • Identifying, characterizing, and classifying
    latent patterns
  • Presenting useful representations and
    interpretations to users
  • DM attempts to answer these questions
  • What patterns are in the information?
  • What are the characteristics of these patterns?
  • Can meaning be ascribed to these patterns
    and/or their changes?
  • Can these patterns be presented to users in a way
    that will facilitate their assessment,
    understanding, and exploitation?
  • Can a machine learn these patterns and their
    relevant interpretations?

4
DM for Decision Support
  • Decision Support is all about
  • enabling users to group information in familiar
    ways
  • controlling complexity by layering results (e.g.,
    drill-down)
  • supporting users changing priorities
  • allowing intuition to be triggered (Ive seen
    this before!)
  • preserving and automating perishable
    institutional knowledge
  • providing objective, repeatable metrics (e.g.,
    confidence factors)
  • fusing simplifying results
  • automating alerts on important results (Its
    happening again!)
  • detecting emerging behaviors before they
    consummate (Look!)
  • delivering value (timely-relevant-accurate
    results)
  • helping users make the best choices.

5
DM Provides Intelligent Analytic Functions
  • Automating pattern detection to characterize
    complex, distributed signatures that are worth
    human attention and recognize those that are
    not.
  • Associating events that go together but are
    difficult for humans to correlate.
  • Characterizing interesting processes not just
    facts or simple events
  • Detecting actionable anomalies and explaining
    what makes them different AND interesting.
  • Describing contexts from multiple perspectives
    with numbers, text and graphics

6
DM Answers Questions Users are Asking
  • Fusion Level 1 Who/What is Where/When in my
    space?
  • Organize and present facts in domain context
  • Fusion Level 2 What does it mean?
  • Has this been seen before? What will happen
    next?
  • Fusion Level 3 Do I care?
  • Enterprise relevance? What action should be
    taken?
  • Fusion Level 4 What can I do better next time?
  • Adaptation by pattern updates and retraining
  • How certain am I?
  • Quantitative assessment of evidentiary pedigree

7
Useful Data Applications
  • Accurate identification and classification add
    value to raw data by tagging and annotation
    (e.g., fraud detection)
  • Anomaly / normalcy and fusion characterize,
    quantify, and assess normalcy of patterns and
    trends (e.g., network intrusion detection)
  • Emerging patterns and evidence evaluation -
    capturing institutional knowledge of how events
    arise and alerting when they emerge
  • Behavior association - detection of actions that
    are distributed in time space but
    synchronized by a common objective
    connecting the dots
  • Signature detection and association detection
    characterization of multivariate signals,
    symbols, and emissions (e.g., voice recognition)
  • Concept tagging - reasoning about abstract
    relationships to tag and annotate media of all
    types (e.g., automated web bots)
  • Software agents assisting analysts
    small-footprint fire-and-forget apps that
    facilitate search, collaboration, etc.

8
Some Good Data Mining Analytic
Applications
  • Help the user focus via unobtrusive automation
  • Off-load burdensome labor (perform intelligent
    searches, smart winnowing)
  • Post smart triggers/tripwires to data stream
    (e.g., anomaly detection)
  • Help with mission triage (Sort my in-basket!)
  • Automate aspects of classification and detection
  • Determine which sets of data hold the most
    information for a task
  • Support construction of ad hoc on-the-fly
    classifiers
  • Provide automated constructs for merging decision
    engines (multi-level fusion)
  • Detect and characterize domain drift (the
    rules of the game are changing)
  • Provide functionality to make best estimate of
    missing data
  • Extract/characterize/employ knowledge
  • Rule induction from data, develop signatures
    from data
  • Implement reasoning for decision support
  • High-dimensional visualization
  • Embed decision explanation capability into
    analytic applications
  • Capture/automate/institutionalize best practice
  • Make proven analytic processes available to all
  • Capture rare, perishable human knowledge and put
    it everywhere
  • Generate signature-ready prose reports

9
Things that make hard problems VERY hard
  • Events of interest occur relatively infrequently
    in very large datasets (population imbalance)
  • Information is distributed in a complex way
    across many features (the feature selection
    problem)
  • Collection is hard to task, data are difficult to
    prepare for analysis, and are never perfect
    (noise in the data, data gaps, coverage gaps)
  • Target patterns are ambiguous/unknown squelch
    settings are brittle (e.g., hard to balance
    detection vs. false-alarm rates)
  • Target patterns change/morph over time and across
    operational modes (domain drift, processing
    methods becomes stale)

10
Some Key Principles of Information Driven Data
Mining
  1. Right People, Methods, Tools (in that order)
  2. Make no prior assumptions about the problem
    (agnostic)
  3. Begin with general techniques that let the data
    determine the direction of the analysis (Funnel
    Method)
  4. Dont jump to conclusions perform process audits
    as needed
  5. Dont be a one widget wonder integrate
    multiple paradigms so the strengths of one
    compensate for the weaknesses of another
  6. Break the problem into the right pieces (Divide
    and Conquer)
  7. Work the data, not the tools, but automate when
    possible
  8. Be systematic, consistent, thorough dont lose
    the forest for the trees.
  9. Document the work so that it is reproducible
  10. Collaborate to avoid surprises team members,
    experts, customer
  11. Focus on the Goal maximum value to the user
    within cost and schedule

11
Select Appropriate Machine Reasoners
  • 1.) Classifiers
  • Classifiers ingest a list of attributes, and
    determine into which of finitely many categories
    the entity exhibiting these attributes falls.
    Automatic object recognition and next-event
    prediction are examples of this type of
    reasoning.
  • 2.) Estimators
  • Estimators ingest a list of attributes, and
    assign some numeric value to the entity
    exhibiting these attributes. The estimation of a
    probability or a "risk score" are examples of
    this type of reasoning.
  • 3.) Semantic Mappers
  • Semantic mappers ingest text (structured,
    unstructured, or both), and generate a data
    structure that gives the "meaning" of the text.
    Automatic gisting of documents is an example of
    this type of reasoning Semantic mapping
    generally requires some kind of domain model.
  • 4.) Planners
  • Planners ingest a scenario description, and
    formulate an efficient sequence of feasible
    actions that will move the domain to the
    specified goal state.
  • 5.) Associators
  • Associators sample the entire corpus of
    domain data, and identify relationships among
    entities. Automatic clustering of data to
    identify coherent subpopulations is a simple
    example. A more sophisticated example is the
    forensic analysis of phone, flight, and financial
    records to infer the structure of terrorist
    networks.

12
Overcoming Processing Challenges through
Intelligent Automation of Data Conditioning,
Feature Selection, and Source Conformation
  • Data Quality
  • Cleanliness, Consistency
  • Comprehensiveness
  • Completeness
  • Correctness
  • Information Quality
  • Representative (ground truth)
  • Timeliness
  • Salience
  • Independence
  • Attributes of Enterprise Problems
  • New trends
  • New behavior/event schemes
  • Non-stationarity
  • Population imbalance
  • Inability to act on findings

13
Embedded Knowledge
  • Principled, domain-savvy synthesis of
    circumstantial evidence
  • Copes well with ambiguous, incomplete, or
    incorrect input
  • Enables justification of results in terms domain
    experts use
  • Facilitates good pedagogical helps
  • Solves the problem like the man does, and so is
    comprehensible to most domain experts.
  • Degrades linearly in combinatorial domains
  • Can grow in power with experience
  • Preserves perishable expertise
  • Allows efficient incremental upgrade/adjustment/re
    purposing

14
Features
  • A feature is the value assumed by some attribute
    of an entity in the domain
  • (e.g., size, quality, age, color, etc.)
  • Features can be numbers, symbols, or complex data
    objects
  • Features are usually reduced to some simple form
    before modeling is performed.
  • gtgtgtfeatures are usually single numeric values or
    contiguous strings.ltltlt

15
Feature Space
  • Once the features have been designated, a feature
    space can be defined for a domain by placing the
    features into an ordered array in a systematic
    way.
  • Each instance of an entity having the given
    features is then represented by a single point in
    n-dimensional Euclidean space its feature
    vector.
  • This Euclidean space, or feature space for the
    domain, has dimension equal to the number of
    features.
  • Feature spaces can be one-dimensional,
    infinite-dimensional, or anywhere in between.

16
How do classifiers work?
17
(No Transcript)
18
Machines
  • Data mining paradigms are characterized by
  • A concept of operation (CONOP component
    structure, I/O, training alg., operation)
  • An architecture (component type, , arrangement,
    semantics)
  • A set of parameters (weights/coefficients/vigilanc
    e parameters)
  • gtgtgtit is assumed here that parameters are
    real numbers.ltltlt
  • A machine is an instantiation of a data mining
    paradigm.
  • Examples of parameter sets for various paradigms
  • Neural Networks interconnect weights
  • Belief Networks conditional probability tables
  • Kernel-Based-classifiers (SVM, RBF) regression
    coefficients
  • Metric classifiers (K-means) cluster centroids

19
A Spiral Methodology for theData Mining Process
20
The DM Discovery Phase Descriptive Modeling
  • OLAP
  • Visualization
  • Unsupervised learning
  • Link Analysis/Collaborative Filtering
  • Rule Induction

21
The DM Exploitation Phase Predictive Modeling
  • Paradigm selection
  • Test design
  • Formulation of meta-schemes
  • Model construction
  • Model evaluation
  • Model deployment
  • Model maintenance

22
A de facto standard DM Methodology
  • CRISP-DM (cross-industry standard process for
    data mining)
  • 1.) Business Understanding
  • 2.) Data Understanding
  • 3.) Data Preparation
  • 4.) Modeling
  • 5.) Evaluation
  • 6.) Deployment

23
Data Mining Paradigms What does your solution
look like?
  • Conventional Decision Models -statistical
    inference, logistic regression, score cards
  • Heuristic Models -human expert, knowledge-based
    expert systems,
  • fuzzy logic, decision trees, belief nets
  • Regression Models -neural networks (all sorts),
    radial basis functions,
  • adaptive logic networks, decision trees, SVM

24
Real-World DM Business Challenges
  • Complex and conflicting goals
  • Defining success
  • Getting buy in
  • Enterprise data is distributed
  • Limited automation
  • Unrealistic expectations

25
Real-World DM Technical Challenges
  • big data consume space and time
  • efficiency vs. comprehensibility
  • combinatorial explosion
  • diluted information
  • difficult to develop intuition
  • algorithm roulette

26
Data Mining Problems What does your domain look
like?
  • How well is the problem understood?
  • How "big" is the problem?
  • What kind of data do we have?
  • What question are we answering?
  • How deeply buried in the data is the answer?
  • How must the answer be presented to the user?

27
1. Business Understanding
  • How well is the problem understood?

28
How well is the problem understood?
  • Domain intuition low/medium/high
  • Experts available?
  • Good documentation?
  • DM teams prior experience?
  • Prior art?
  • What is the enterprise definition of success?
  • What is the target environment?
  • How skillful are the users?
  • Where are the pitchforks?

29
2. Data Understanding3. Preparing the Data
  • How "big" is the problem?
  • What kind of data do we have?

30
DM Aspects of Data Preparation
  • Data Selection
  • Data Cleansing
  • Data Representation
  • Feature Extraction and Transformation
  • Feature Enhancement
  • Data Division
  • Configuration Management

31
How "big" is the problem?
  • Number of exemplars (rows)
  • Number of features (columns)
  • Number of classes (ground truth)
  • Cost/schedule/talent (dollars, days, dudes)
  • Tools (own/make/buy, familiarity, scope)

32
What kind of data do we have?
  • Feature type nominal/numeric/complex
  • Feature mix homo/heterogeneous by type
  • Feature tempo
  • Fresh/stale
  • Periodic/sporadic
  • Synchronous/asynchronous
  • Feature data quality
  • Low/high SNR
  • Few/many gaps
  • Easy/hard to access
  • Objective/subjective
  • Feature information quality
  • Salience, correlation, localization, conditioning
  • Comprehensive? Representative?

33
How much data do I need?
  • Many heuristics
  • Montes 6MN rule, other similar
  • Support vectors
  • Segmentation requirements
  • Comprehensive
  • Representative
  • Consider population imbalance

34
Feature Saliency Tests
  • Correlation/Independence
  • Visualization to determine saliency
  • Autoclustering to test for homogeneity
  • KL-Principal Component Analysis
  • Statistical Normalization (e.g., ZSCORE)
  • Outliers, Gaps

35
Making Feature Sets for Data Mining
  • Converting Nominal Data to Numeric Numeric
    Coding
  • Converting Numeric data to Nominal Symbolic
    Coding
  • Creating Ground-Truth

36
Information can be Irretrievably Distributed
(e.g., the parity-N problem)
  • 0010100110 1
  • The best feature set is not necessarily the set
    of best features.

37
An example of a Feature Metric
  • Salience geometric mean of class precisions
  • an objective measure of the ability of a feature
    to distinguish classes
  • takes class proportion into account
  • specific to a particular classifier and problem
  • does not measure independence

38
Nominal to Numeric Coding...one step at a time!
Original Data
Step 1
Step 2
39
Numeric to Nominal Quantization
40
Clusters Usually Mean Something
41
How many objects are shown here? One, seen from
various perspectives!This illustrates the danger
of using ONE METHOD/TOOL/VISUALIZATION!
42
Autoclustering
  • Automatically find spatial patterns in complex
    data
  • find patterns in data
  • measure the complexity of the data

43
Differential Analysis
  • Discover the Difference Drivers Between Groups
  • Which combination of features accounts for the
    observed differences between groups?
  • Focus research

44
Sensitivity Analysis
  • Measure the Influence of Individual Features on
    Outcomes
  • Rank order features by salience and independence
  • Estimate problem difficulty

45
Rule Induction
  • Automatically find semantic patterns in complex
    data
  • discover rules directly from data
  • organize raw data into actionable knowledge

46
A Rule Induction Example(using data splits)
47
Rule Induction Example (Data Splits)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
4. Modeling
  • What question are we answering?
  • How deeply buried in the data is the answer?
  • How must the answer be presented to the user?

54
(No Transcript)
55
What question are we answering?
  • Ground truth type
  • Nominal
  • Numeric
  • Complex (e.g., interval estimate, plan, concept)
  • Ground truth data quality
  • Low/high SNR
  • Few/many gaps
  • Easy/hard to access
  • Objective/subjective
  • Ground truth predictability
  • Correlation with features
  • Population balance
  • Class collisions

56
How deeply buried in the data is the answer?
  • Solvable by a 1 layer Multi-Layer Perceptron
    (easy)
  • Linearly separable any two classes can be
    separated by a hyperplane
  • Solvable by a 2 layer Multi-Layer Perceptron
    (moderate)
  • Convex hulls of classes overlap, but classes do
    not
  • Solvable by a 3 layer Multi-Layer Perceptron
    (hard)
  • Classes overlap but do not collide
  • intractable
  • Data contain class collisions

57
How must the answer be presented to the user?
  • Forensics
  • GUI, confidence factors, intervals, justification
  • Integration
  • Web-based, Web-enable, dll/sl, fully integrated
  • Accuracy
  • correct, confusion matrix, lift chart
  • Performance
  • Throughput, ease of use, accuracy, reliability

58
(No Transcript)
59
Text Book Neural Network
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
Knowledge Acquisition
  • What the Expert says
  • KE ...and, primates. What evidence makes you
    CERTAIN an animal is a primate?
  • KE Yeah, well, like...If its a land animal
    thatll eat anything...but it bears live young
    and walks upright,...
  • KE Any obvious physical characteristics?
  • EX Uh...yes...and no feathers, of course, or
    wings, or any of that... Well, then...then, its
    gotta be a primate...yeah.
  • KE So, ANY animal which is a land-dwelling,
    omnivorous, skin-covered, unwinged featherless
    biped which bears live young is NECESSARILY a
    primate?
  • EX Yep.
  • KE Could such an animal, be, say, a fish?
  • EX No...it couldnt be anything but a primate.

68
What the KE hears
  • IF
  • (f1,f2,f3,f4,f5) (land, omni, feathers,
    wingless biped, born alive)
  • THEN
  • PRIMATE and (not fish, not domestic, not bug, not
    germ, not bird)

69
Evaluation
  • How must the answer be presented to the user?

70
Model Evaluation
  • Accuracy
  • Classification accuracy, geometric accuracy
  • precision/recall
  • RMS
  • Lift curve
  • Confusion matrices
  • ROI
  • Speed, space, utility, other

71
Classification Errors
  • Type I - Accepting an item as a member of a class
    when it is actually false a false positive.
  • Type II - Rejecting an item as a member of a
    class when it actually is (true) a false
    negative.

72
Model Maintenance
  • Retraining, stationarity
  • Generalization (e.g. heteroscedasticity)
  • Changing the feature set (add/subtract)
  • Conventional maintenance issues

73
What do we give the user besides an application?
  • Documentation
  • Support
  • Model retraining
  • New model generation

74
Using a Paradigm Taxonomy to Select a DM Algorithm
  • Place paradigms into a taxonomy by specifying
    their attributes. This taxonomy can be used for
    algorithm selection.
  • First, an example taxonomy.

75
KBES (knowledge-based Expert System)
required intuition high vector
count supported high feature count
supported medium class count
supported medium cost to develop
high schedule to develop high
talent to develop medium, high
tools to develop can be expensive to buy/make
feature types supported
nominal/numeric/complex feature mix
supported homogeneous, heterogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, complex relative
representational power low relative
performance fast, intuitive, robust
relative weaknesses ad hoc relatively simple
class boundaries relative strengths
intuitive easy to provide conclusion
justification
76
MLP (Multi-Layer Perceptron)
required intuition low vector count
supported high feature count
supported medium class count
supported medium cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable uncontrolled
regression relative strengths easy
to build
77
RBF (Radial Basis Function) required
intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed need not fill
"gaps" ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable models tend
to be large relative strengths
uncontrolled regression can be mitigated
78
SVM (Support Vector Machines)
required intuition low vector count
supported high feature count
supported high class count
supported two cost to develop
medium schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported numeric
feature mix supported homogeneous
feature data quality needed must fill "gaps"
ground truth types supported
nominal, numeric relative
representational power high
relative performance moderately fast
relative weaknesses inscrutable can be hard
to train relative strengths minimal
need to enhance features
79
Decision Trees (e.g., CART, BBNs)
required intuition low vector count
supported high feature count
supported medium class count
supported high cost to develop
low schedule to develop medium
talent to develop medium
tools to develop easy to obtain inexpensively
feature types supported nominal,
numeric feature mix supported
homogeneous, heterogeneous feature
data quality needed need not fill "gaps"
ground truth types supported nominal,
numeric relative representational
power high relative performance
moderately fast relative weaknesses
many "low support" nodes or rules
relative strengths can provide insight into the
domain
80
The taxonomy can be used to match available
paradigms with the characteristics of the data
mining problem to be addressed
81
IF the ground truth is discrete
there aren't too many classes the class
boundaries are simple the number of
features is medium the data are
heterogeneous no comprehensive,
representative data set with GT the
population is unbalanced by class the
domain is well-understood by available experts
conclusion justification is neededTHEN
KBES
82
ELSE IF the ground truth is
numeric there is a medium number of
classes the class boundaries are complex
the number of features is medium the
data are numeric comprehensive,
representative data set tagged with GT the
population is relatively balanced by class
the domain is not well-understood by available
experts conclusion justification is not
neededTHEN MLP
83
ELSE IF the ground truth is numeric or
nominal there is a large number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
RBF
84
ELSE IF the ground truth is numeric or
nominal the number of classes is two
the class boundaries very complex the
number of features is very large the data
are numeric comprehensive, representative
data set tagged with GT the population is
unbalanced by class the domain is not
well-understood by available experts
conclusion justification is not neededTHEN
SVM
85
ELSE IF the ground truth is numeric or
nominal there is a medium number of
classes the class boundaries are very
complex the number of features is medium
the data are numeric, nominal, or complex
representative data set tagged with GT
the population is unbalanced by class the
domain is not well-understood by available
experts conclusion justification is
neededTHEN Decision Tree (CART, BBN,
etc.) END IF
86
Common Reasons Data Mining Projects Fail
87
Mistakes can occur in each major element of data
mining practice!
  • 1. Specification of Enterprise Objectives
  • Defining success
  • 2. Creation of the DM Environment
  • Understanding and Preparing the Data
  • 3. Data Mining Management
  • 4a,b. Descriptive Modeling and Predictive
    Modeling
  • Detecting and Characterizing Patterns
  • Building Models
  • Model Evaluation
  • Model Deployment
  • Model Maintenance

88
1. Specification of Enterprise Objectives
  • Define success
  • Knowledge acquisition interviews (who, what, how)
  • Objective measures of performance (enterprise
    specific)
  • Assessment of enterprise process and data
    environment
  • Specification of data mining objectives

89
Specification Mistakes
  • DM projects require careful management of user
    expectations. Choosing the wrong person as
    customer interface can guarantee user
    disappointment.
  • (GIGOO Garbage in, GOLD out!)
  • Since the default assessment of RD type
    efforts is failure, not defining success
    unambiguously will guarantee failure.

90
2. Creation of the DM Environment
  • Data Warehouse/Data Mart /Database
  • Meta data and schemas
  • Data dependencies
  • Access paths and mechanisms

91
Environmental Mistakes
  • Big data require bigger storage. DM efforts
    typically work against multiple copies of the
    data try 2 or 3 x.
  • Unwillingness to invest in tools forces data
    miners to consume resources building inferior
    versions of what could have been purchased more
    cheaply.
  • Get labs and network connections set up quickly.

92
Understanding the Data
  • Enterprise data survey
  • Data as a process artifact
  • Temporal Considerations
  • Data Characterization
  • Metadata
  • Collection paths
  • Data Metrics and Quality
  • currency, completeness, correctness, correlation

93
A List of Common Data Problems
  • Conformation (e.g., a dozen ways to say lat/lon)
  • Accessibility (distributed, sensitive)
  • Ground Truth (missing, incorrect)
  • Outliers (detect/process)
  • Gaps (imputation scheme)
  • Time (coverage, periodicity, trends, Nyquist)
  • Consistency (intra/inter record)
  • Class collisions (how to adjudicate)
  • Class population imbalance (balancing)
  • Coding/quantization

94
Data Understanding Mistakes
  • Assuming that no understanding of the domain is
    needed for a successful DM effort
  • Temporal infeasibility assuming every type of
    data you find in the warehouse will actually be
    there when your fielded system needs it.
  • Ignoring the data conformation problem

95
Data Preparation Mistakes
  • Improper handling of missing data, outliers
  • Improper conditioning of data
  • Trojan Horsing ground truth into the feature
    set
  • Having no plan for getting operational access to
    data

96
3. Data Mining Management
  • Data mining skill mix (who are the DM
    practitioners?)
  • Data mining project planning (RAD vs. waterfall)
  • Data mining project management
  • Sample DM project cost/schedule
  • Dont forget Configuration Management!

97
DM Management Mistakes
  • Appointing a domain expert as the technical
    lead on a DM project virtually guarantees that no
    new ground will covered.
  • Inadequate schedule and/or budget poison the
    psychological atmosphere necessary for discovery.
  • Failure to parallelize work
  • Allowing planless tinkering
  • Letting technical people snow you
  • Failure to conduct process audits

98
Configuration Management
  • Nomenclature and naming conventions
  • Documenting the workflow for reproducibility
  • Modeling Process Automation

99
Configuration Management Mistakes
  • Not having a configuration management plan
    (files, directories, nomenclature, audit trail)
    virtually guarantees that any success you have
    will be unreproduceable.
  • Allowing each data miner to establish their own
    documentation and auditing procedures guarantees
    that no one will understand what anyone else has
    done.
  • Failure to automate configuration management
    (e.g., putting annotated experiment scripts in a
    log) guarantees that your configuration
    management plan will not work.

100
4a. Descriptive Modeling
  • OLAP (on-line analytical processing)
  • Visualization
  • Unsupervised learning
  • Link/Market Basket Analysis
  • Collaborative Filtering
  • Rule Induction Techniques
  • Logistic Regression

101
4b. Predictive Modeling
  • Paradigms
  • Test Design
  • Meta-Schemes
  • Model Construction
  • Model Evaluation
  • Model Deployment
  • Model Maintenance

102
Paradigms
  • Know what they are
  • Know when to use which
  • Know how to instantiate them
  • Know how to validate them
  • Know how to maintain them

103
Model Construction
  • Architecture (monolithic, hybrid)
  • Formulation of Objective Function
  • Training (e.g., NN)
  • Construction (e.g., KBES)
  • Meta Schemes
  • Bagging
  • Boosting
  • Post-process model calibration

104
Modeling Mistakes
  • The Silver Bullet Syndrome relying entirely
    on a single tool/method
  • Expecting your tools to think for you
  • Overreliance on visualization
  • Using tools that you dont understand
  • Not knowing when to quit (maybe this is just
    dirt)
  • Quitting too soon (I havent dug deep enough)
  • Picking the wrong modeling paradigm
  • Ignoring population imbalance
  • Overtraining
  • Ignoring feature correlation

105
5. Model Evaluation
  • Blind Testing
  • N-fold Cross-Validation
  • Generalization and Overtraining

106
Model Evaluation Mistakes
  • Not validating the model
  • Validating the model on the training data
  • Not escrowing a holdback set

107
  6. Model Deployment
  • ASP (applications service provider)
  • API (application program interface)
  • Other
  • plug-ins
  • linked objects
  • file interface, etc.

108
Model Deployment Mistakes
  • Not considering the fielded architecture
  • No user training
  • Not having any operational performance
    requirements (except accuracy)

109
7. Model Maintenance
  • Retraining
  • Poor generalization
  • Heteroscedasticity
  • Non-stationarity
  • Overtraining
  • Changing the problem architecture
  • Adding/subtracting features
  • Modifying ground truth
  • Other

110
Model Maintenance Mistakes
  • Not having a mechanism, method, and criteria for
    tracking performance of the fielded model
  • Not providing a model retraining capability
  • No documentation, no support

111
Published byDigital Press, 2001 ISBN
1-555558-231-1
112
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com