Title: Data%20Mining
1Data Mining
- Edward, Hong Zhang
- CS Dept, SUNY, Albany
- CSI 668, March,20. 2001
2Presentation Outline
- Motivation
- Background (KDD Process)
- Whats Data Mining?
- Why Data Mining?
- The Data Mining Process
- Data Mining Algorithms
- Data Mining Research Trend
- Existing Systems
- for Data Mining
- Conclusions
3Motivation Necessity is the mother of invention
- Data explosion problem
- Automated data collection tools, availability of
increasingly cheap storage devices and mature
database technology lead to - tremendous amounts of data stored in
database, data warehouses and other information
repositories. - We are drowning in data, but starving for
knowledge! - Data is everywhere
- Understand and use dataan imminent task!
- Solution Knowledge Discovery (Data warehousing
and data mining)
4Evolution of Database Technology
- 1960s-1970s
- Data collection, database creation, IMS and
network DBMS. - 1970s-1980s
- Relational data model, relational DBMS
implementation. - 1980s-1990s
- RDBMS, advanced data models (extended-relational,
OO, - deductive, etc.) and application-oriented DBMS
(spatial, - scientific, engineering, etc.).
- 1990s-right now
- Data mining and data warehousing, multimedia
databases, and - Web-based database technology.
5Background
- Knowledge Discovery (KD)
- the process of finding general
patterns/principles that summarize/explain a set
of "observations". - The Knowledge Discovery in Databases (KDD)
- Very Large DataBases (VLDB) have become the
industry standard, making it impossible for human
beings to mine the data "by hand" to look for
interesting patterns. Automated tools are
therefore needed to help to extract these
patterns.
6Background Cont.
- The knowledge discovery in databases (KDD)
consists of 3 steps - Data Integration (Data Warehousing)
- Collecting the target data observations from
the different data sources, removing noise from
the observations, and integrating them into an
appropriate format. - Data Mining (will be covered in detail)
- Applying a concrete algorithm to find useful
and novel patterns in the integrated data.
7Background Cont.
- Pattern Evaluation
- Interpreting mined patterns, evaluating them
according to usefulness/interestingness criteria,
and possibly using visualization tools to aid in
understanding the patterns graphically. -
- See KDD process graph below
8Data Mining KDD process
Knowledge
Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9What Is Data Mining?
- Data Mining (knowledge discovery in databases)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information (knowledge) or patterns from data in
large databases, data warehouse or other
information repositories - What is not data mining?
- (Deductive) query processing.
- Expert systems or Machine Learning/statistical
programs - Online Analytical Processing (OLAP)
- Software Agents
- Data Mining Confluence of Multiple
Disciplines
10Database, OLAP,
High Performance Computing
Data Mining
Visualization
Machine Learning (AI)
Pattern recognition
Statistics Modeling
Information Science
11Why Data Mining? Potential Applications
- Database analysis and decision support
- System (DSS)
- Market analysis and management
- target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation. - Risk analysis and management
- Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis. - Text mining (Text Databases, documents), key
words search and analysis. - DNA sequence analysis and gene expression.
12Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Useful Pattern
Visualization Techniques
Data Analyst
Data Mining
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
13Why Data Mining? Potential Applications (Cont.)
- Internet Web Surf-Aid (Web Mining)
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc. - Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat.
14The Data Mining Process
Data set
Data Mining System
training
Data Mining Algorithm
evaluation
model
prediction
Score model
Historical Training data
Results Pattern
New data
15Examples of Discovered Patterns
- Association rules find rules between
different attributes - 98 of AOL users also have EBay accounts
- Classification Classify data based on the
values in a classifying attribute - People age less than 40 and salary gt 40,000
trade on-line - Clustering Group data to form new classes
- Users A and B access similar URLs, they belong to
the same group, which has similar user profiles.
16Are All the Discovered Patterns Interesting?
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Suggested approach Query-based, focused mining
- Interestingness measures A pattern is
interesting if it is - easily understood by humans
- valid on new or test data with some degree of
certainty. - potentially useful
- novel, or validates some hypothesis that a user
seeks to confirm
17How can we Find All and Only Interesting Patterns?
- Find all the interesting patterns Completeness.
- Can a data mining system find all the interesting
patterns? - Search only interesting patterns Optimization.
- Can a data mining system find only the
interesting patterns? - Approaches
- First generate all the patterns and then filter
out the uninteresting ones. - Generate only the interesting patterns --- mining
query optimization
18Data Mining Algorithms
- Four common DM algorithm types
- The k-Nearest Neighbor Algorithm (KNN)
- Artificial Neural Network (ANN)
- Rule Induction
- Decision Trees
19The k-Nearest Neighbor Algorithm (KNN)
- A technique that classifies each record in a
dataset based on a combination of the classes of
the k record(s) most similar to it in a
historical dataset - Use entire training database as the model
- Find nearest data
- point and do the
- same thing as you
- did for that record
-
.
-
-
-
xq
-
20The k-Nearest Neighbor Algorithm (KNN) (Cont.)
- Distance-weighted nearest neighbor algorithm.
- Weight the contribution of each of the k
neighbors according to their distance to the
query point Xq. - giving greater weight to closer neighbors
- Advantages
- Calculate the mean values of the k nearest
neighbors. - Robust to noisy data by averaging k-nearest
neighbors. - Very easy to implement.
- Disadvantage
- Huge Models ( the entire training database )
- More difficult to use in production.
21Artificial neural networks Algorithm (ANN)
- Non-linear predictive models that learn through
training and loosely resemble biological neural
networks in structure. - Inputs transformed through a network of simple
processors - Processor combines (weighted) inputs and produces
an output value
22Artificial neural networks (Cont.)
mk
-
(Learning Rate)
x0
w0
x1
w1
f
Ã¥
output y
xn
wn
Input vector x
weight vector w
weighted sum
Activation function
- The n-dimensional input vector x is mapped into
variable y by means of the scalar product and a
nonlinear function mapping
23Multi layer perception of Artificial neural
networks
Output vector
Output nodes
Hidden nodes
Input nodes
Input vector xi
24Artificial Neural Network evaluation
- Advantages
- prediction accuracy is generally high
- robust,still works when training examples contain
errors - Disadvantages
- Key problem Difficult to understand
- The neural network model is difficult
- to understand
- No intuitive understanding of results
- Long training time
- Although after training, process is very quick,
- the training process itself is
time-consuming - Significant pre-processing of data often required
25Rule Induction
- Rule Induction (rule-based prediction)
- We first generate a set of rules from a data
warehouse, - then use them to predict values for new data
item. - It works much better on larger (and real)data
sets, not just on samples of data. - Two phases
- Rule discovery analyze a historical database
and generate a set of rules by automatic
discovery. - Prediction apply the rules to a new data set
and match the rules to make predictions.
26Rule Induction Example
Training Set
27Rule Induction Example (Cont.)
- 4 attributes
- Outlook can be sunny, overcast, rainy 3
cases - Temperature hot, mild, cool
3 cases - Humidity high, normal
2 cases - Windy true, false
2 cases - 1 outcome class (N no class, P have class)
- Totally we should have 332236 possible
combinations, of which 14 are present in the - set of input examples.
28Rule Induction Example (Cont.)
- Some rules inducted from above dataset
- Classification rules
- If outlook sunny and humidity high then
class n. - If outlook rainy and windy true then
class n - if outlook overcast
then class p - Association rules
- If temperature cool then humidity
normal - If windyfalse and classn then outlook
sunny and -
humidity high
29What is a decision tree?
- A decision tree is a flow-chart-like tree
structure. - Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- All tuples in branch have the same value for the
tested attribute. - Leaf node represents class label or class label
distribution. - A series of nested if/then rules
- Understandable!
30A Sample Decision Tree
The same Training set with Rule Induction
Outlook
sunny
rain
overcast
humidity
windy
P
true
false
high
normal
N
P
N
P
31Another Example for DT
If x1 and y0 then class a If x0 and y1
then class a If x0 and y0 then class
b If x1 and y1 then class b
32Another Example for DT
Credit Analysis
salary lt 20000
Yes
no
education in graduate
accept
no
yes
reject
accept
33Decision-Tree Classification Methods
- The basic top-down decision tree generation
approach usually consists of two phases - Tree construction
- At start, all the training examples are at the
root. - Partition examples recursively based on selected
attributes. - Tree pruning
- Aiming at removing tree branches that may lead to
errors when classifying test data (training data
may contain noise, statistical fluctuations, )
34How to construct a tree?
- Algorithm
- greedy algorithm
- make optimal choice at each step select the best
attribute for each tree node. - top-down recursive divide-and-conquer manner
- from root to leaf
- split node to several branches
- for each branch, recursively run the algorithm
35How to prune a tree
- A decision tree constructed using the training
data may have too many branches/leaf nodes. - Caused by noise, overfitting
- May result poor accuracy for unseen samples
- Prune the tree merge a subtree into a leaf node.
- Using a set of data different from the training
data. - At a tree node, if the accuracy without splitting
is higher than the accuracy with splitting,
replace the subtree with a leaf node, label it
using the majority class.
36How to use a tree?
- Directly
- test the attribute value of unknown sample
against the tree. - A path is traced from root to a leaf which holds
the label - Indirectly
- decision tree is converted to classification
rules - one rule is created for each path from the root
to a leaf - IF-THEN is easier for humans to understand
37Decision tree for a covering algorithm
38Data Mining Algorithm Summary
- KNN
- Quick and easy
- Models tend to be very large
- ANN
- Difficult to interpret
- Can require significant amounts of time to train
- Rule Induction
- Understandable
- Need to limit calculations
- Decision Trees
- Understandable
- Relatively fast
- Other DM Technologies
- Genetic Algorithms
- Rough sets
- Bayesian networks
- Mixture models
- Many more...
39Data Mining Research Trend
- Text mining Text database and information
retrieval - Multimedia data mining
- OLAM (OLAP Mining)
- Web mining (Data Mining and WWW)
- E-commerce
- Information retrieval (search)
- Network management
40Why Mine the Web?
- Web A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected, evolving
information repository. - Web is a huge collection of documents plus
- Hyper-link information
- Access and usage information
- Enormous wealth of information on Web
- Financial information (e.g. stock quotes)
- Book/CD/Video stores (e.g. Amazon)
- Restaurant information (e.g. Zagats)
- Car prices (e.g. Carpoint)
- Lots of data on user access patterns
- Web logs contain sequence of URLs accessed by
users
41Why is Web Mining Different?
- Huge The Web is a huge collection of documents
except for - Hyper-link information
- Access and usage information
- DynamicThe Web is very dynamic
- New pages are constantly being generated
- Unstructured Complexity of Web pages far
greater than text document collection - Challenge Develop new Web mining algorithms and
adapt traditional data mining algorithms to - Exploit hyper-links and access patterns
- Be incremental
42Types of Web Mining
43Web Mining Applications
- E-commerce (Infrastructure)
- Generate user profiles
- Targetted advertizing
- Fraud detection
- Similar image retrieval
- Information retrieval (Search) on the Web
- Automated generation of topic hierarchies
- Web knowledge bases
- Extraction of schema for XML documents
- Network Management
- Performance management
- Fault management
44Existing Systems for Data Mining
- IBM Intelligent Miner.
- SAS Institute Enterprise Miner.
- Silicon Graphics MineSet.
- Integral Solutions Ltd. Clementine.
- Information Discovery Inc.
- Data Mining Suite.
- DBMiner Technology Inc. DBMiner
- Rutger DataMine, GMD Explora, Univ. Munich
VisDB
45Microsoft OLE DB for Data Mining
- Microsoft OLE, OLE DB, OLE DB for OLAP and OLE DB
for Data Mining - OLE DB for DM Standardization July 1999 to March
2000 - Microsoft SQL Server 2000 Analysis manager
- Analysis manager consists of OLAP and Data Mining
- Data mining two modules (Classification/Predictio
n and clustering) - OLDB for DM Data mining providers (such as
association modules and other classification or
clustering modules)
46Research Progress for Data Mining in the Last
Decade
- Multi-dimensional data analysis Data warehouse
and OLAP (on-line analytical processing) - Association, correlation, and causality analysis
- Classification scalability and new approaches
- Clustering and outlier analysis
- Sequential patterns and time-series analysis
- Text mining, Web mining and Weblog analysis
- Spatial, multimedia, scientific data analysis
- Data preprocessing and database compression
- Data visualization and visual data mining
47Conclusions
- Knowledge Discovery in Databases (KDD)
- Data warehouse An industry trend
- DW stores a huge amount of subject-oriented,
cleansed, integrated, consolidated, time-related
data. - Data Mining A rich, promising, young field with
broad applications and many challenging research
issues. Good science - leading position in
research community
48Conclusions (Cont.)
- Data mining tasks characterization, association,
classification, clustering, prediction, sequence
and pattern analysis, etc. - Data mining Algorithms
- The k-Nearest Neighbor Algorithm (KNN)
- Artificial Neural Network (ANN)
- Rule Induction
- Decision Trees
- Research progress and trend in Data Mining
49Future Work
- Theoretical foundations of data mining.
- Implementation and new data mining methodologies
- A set of well-tuned, standard mining operators.
- Data and knowledge visualization tools.
- Integration of multiple data mining strategies.
- Data mining in advanced information systems
- Spatial, multimedia, Web-mining
- Data mining applications
- content browsing, query optimization,
multi-resolution model, etc. - Social issues A threat to security and privacy.