Title: Data Mining general ideas
1Data Mining general ideas
2Data Mining a definition
Art/Science of uncovering non-trivial,
valuable information from a large database
- Emphasis on
- non-obvious (difficult)
- useful (cost vs benefit)
- large (automatic)
- Yet, no rules, provided that the process is
efficient in - time, space and human resources.
3Three big steps
Data preparation
Data analysis
Neural Networks
Decision making
4Data preparation
Extract / Integrate data Transform Select Cleanse
Data warehouse
50-80 project time
5Scrubbing, selecting, cleansing, preprocessing,
- Eliminate redundancy
- Eliminate irrelevant data
- Deal with missing data
- mean, clever substitute, interpolate, ignore, ?
- Correct errors
- Outliers
- Check consistency
- Reserve relevant preprocessing for the data
analysis
6Data analysis
- Techniques
- Decision trees
- Association rules
- Polynomial regressions
- Genetic algorithms
- Neural networks
-
- Conceptual tasks
- Classification
- Optimization
- Interpolation
- Modeling
- Prediction
-
- Goals
- Target marketing
- Market segmentation
- Process control
- Sales forecasting
- Market laws
-
Neural networds are a general purpose analysis
tool based on machine learning from patterns. As
a mathematical tool, they implement bayesian
inference. They build non-linear models from
examples.
7Decision making
(Sometimes undestimated in neural network culture)
Data analysis may seem unscrutable
- Data analist must deeply understand the problem
- Results must be fairly presented
- Post-processing and merging with subjective
factors - is often necessary
- Strict validation is necessary
8Why is Data Mining not widely used?
- Why commercial suites (Office, Lotus,
StarOffice,) do - not include neural networks (or other advanced
tools)? - Bold exploitation is often meaningless or
non-competitive. - Common sense and good training are needed to
work - out a valuable neural net model high cost
- We are living through the early stages of the
Information Era