Title: Introduction to DataMining
1Introduction toData-Mining
- Marko Grobelnik
- Institut Jozef Stefan
2Outline
- Motivation Definition
- What are typical applications?
- How do we build solutions?
- Method algorithms
- Tools standards
- conclusion
3Motivation Necessity is the Mother of
Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories - We are drowning in data, but starving for
knowledge!
4Data pyramid
Wisdom
Knowledge experience
Knowledge
Information rules
Information
Data context
Data
5What Is Data Mining?
- Data mining (knowledge discovery in databases -
KDD, business intelligence) - Extraction of interesting ( non-trivial,
implicit, previously unknown and potentially
useful) information from data in large databases - Tell me something interesting about the data.
- Describe the data.
6Potential Applications
- Database analysis and decision support
- Market analysis and management
- Risk analysis and management
- Fraud detection and management
- Text analysis - Text Mining
- Web analysis - Web Mining
- Intelligent query answering
7Market Analysis and Management
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies. - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Conversion of single to a joint bank account
marriage, etc.
8Analysis and Risk Management
- Finance planning and asset evaluation
- cash flow analysis and prediction
- time series analysis (trend analysis, etc.)
- Resource planning
- summarize and compare the resources and spending
- Competition
- Monitor competitors and market directions
- Set pricing strategy in a highly competitive
market
9Fraud Detection and Management
- Use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples application
- Auto Insurance detect a group of people who
stage accidents to collect on insurance - Money Laundering detect suspicious money
transactions - Detecting telephone fraud detecting suspicious
patterns (generate call model - destination,
time, duration)
10Other Areas of application
- Sports
- Analysis of game in NBA (eg., detect the
opponents strategy) - Astronomy
- discovery and classification of new objects
- Internet
- analysis of Web access logs, discovery of user
behavior patterns, analyzing effectiveness of Web
marketing, improving Web site organization - Text
- news analysis, medical record analysis, automatic
email sorting and filtering, automatic document
categorization
11Data mining intersection of multiple
disciplines
- Database systems, data warehouse and OLAP
- Statistics
- Machine learning
- Visualization
- Information science
- High performance computing
- Other disciplines
- Neural networks, mathematical modeling,
information retrieval, pattern recognition, ...
12From data to knowledge
Knowledge
- Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
13Main steps of KDD
- Learning the application domain
- relevant prior knowledge and goals of application
- Data cleaning and preprocessing (may take 60 of
effort!) - creating a target data set data selection
- find useful features, generate new features, map
feature values, discretization of values - Choosing data mining tools/algorithms
- summarization, classification, regression,
association, clustering. - Data mining search for patterns of interest
- Interpretation analysis of results.
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge.
14Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
15Mining the data what kind of data?
- Relational databases
- Data warehouses
- Transactional databases
- Advanced DB systems and information repositories
object-oriented and object-relational databases,
spatial databases, time-series data and temporal
data, text databases and multimedia databases,
heterogeneous and legacy databases, WWW
16Data mining algorithms (I)
- Association
- finding rules like if the customer bought item
A, then in X of transactions she/he also bought
item B. This holds for Y of all transactions - Classification and Prediction
- classify data based on the values in a
classifying attribute, e.g., classify countries
based on climate, or classify cars based on gas
mileage - predict some unknown or missing attribute values
based on other information
17Data mining algorithms (II)
- Clustering
- group data to form new classes, e.g., find
groups of customers with similar behavior - Time-series analysis
- trend and deviation analysis find and
characterize evolution trend, sequential
patterns, similar sequences, and deviation data,
e.g., stock analysis. - similarity-based pattern-directed analysis find
and characterize user-specified patterns in
large databases. - cyclicity/periodicity analysis find
segment-wise or total cycles or periodic
behaviors in time-related data. - Other pattern-directed or statistical analysis
18Association rules
- Finding associations or correlations among a set
of items - Applications
- basket data analysis, cross-marketing,
- Example
- buying beer and chips -gt ketchup 0.5,60
- rule formLHS RHS support, confidence
19Classification
- Finding rules that describe given groups of
objects - Applications credit approval, target marketing,
medical diagnosis, treatment effectiveness
analysis,... - Example based on the past symptoms and diagnoses
of patients generate a model describing influence
of symptoms to disease to be used for
classification of future test data and better
understanding of each class - Methods decision-trees (e.g., ID3, C5),
statistics, neural networks,...
20Classification using decision trees
- A decision tree
- Top-down decision tree generation algorithm, at
each step - partition examples based on the selected
attribute value - select attribute favoring the partitioning which
makes the majority of examples belong to a single
class
outlook
sunny
rain
overcast
windy
humidity
P
N
P
N
P
21Classification methods
- Decision trees and decision rules
- give a training set of labeled data
- tree pruning used for noise handling and avoiding
data overfiting - Bayesian classification
- Naïve Bayesian classification
- Bayesian belief networks
- Neural network approach
- multi-layer networks and back-propagation
- Genetic algorithms
- genetic operators (mutation, cross-over,) and
fitness function selection
22Clustering methods
- partitioning a set of data into a set of classes,
called clusters, such that the members of each
class are sharing some interesting common
properties. - high quality clusters if the intra-class
similarity is high and the inter-class similarity
is low - Important is distance measure
23Data-Mining tools
- Main producers of Data-Mining software
- IBM Intelligent Miner, extender for DB2
- SAS Enterprise Miner
- SPSS Clementine
- Microsoft Analysis Server (part of SQL Server
2000) - many more smaller producers
24Data Mining standards
- PMML (Predictive Modelling Markup Language)
- XML like language for saving and sharing models
(most widely accepted standard) - CRISP
- standardized methodology for building Data Mining
applications - OLE DB for Data Mining
- Microsofts standard for developing OLEDB/COM
components for extending Analysis server with new
Data Mining functionality (uses customized SQL
language) - IBM and Oracle prepared standard extensions to
SQL language to support Data Mining functionality
25conclusion
- Data Mining is an area in the rapid development
- Who and Why needs Data Mining?
- (almost) everybody having the data?
- to get something more out of the data
- More information
- http//www.kdnuggets.com/