Data%20Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Mining

Description:

Motivation 'Necessity is the mother of invention' Data explosion problem: ... We are drowning in data, but starving for knowledge! Data is everywhere ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 50
Provided by: zhang55
Learn more at: http://www.cs.albany.edu
Category:

less

Transcript and Presenter's Notes

Title: Data%20Mining


1
Data Mining
  • Edward, Hong Zhang
  • CS Dept, SUNY, Albany
  • CSI 668, March,20. 2001

2
Presentation Outline
  • Motivation
  • Background (KDD Process)
  • Whats Data Mining?
  • Why Data Mining?
  • The Data Mining Process
  • Data Mining Algorithms
  • Data Mining Research Trend
  • Existing Systems
  • for Data Mining
  • Conclusions

3
Motivation Necessity is the mother of invention
  • Data explosion problem
  • Automated data collection tools, availability of
    increasingly cheap storage devices and mature
    database technology lead to
  • tremendous amounts of data stored in
    database, data warehouses and other information
    repositories.
  • We are drowning in data, but starving for
    knowledge!
  • Data is everywhere
  • Understand and use dataan imminent task!
  • Solution Knowledge Discovery (Data warehousing
    and data mining)

4
Evolution of Database Technology
  • 1960s-1970s
  • Data collection, database creation, IMS and
    network DBMS.
  • 1970s-1980s
  • Relational data model, relational DBMS
    implementation.
  • 1980s-1990s
  • RDBMS, advanced data models (extended-relational,
    OO,
  • deductive, etc.) and application-oriented DBMS
    (spatial,
  • scientific, engineering, etc.).
  • 1990s-right now
  • Data mining and data warehousing, multimedia
    databases, and
  • Web-based database technology.

5
Background
  • Knowledge Discovery (KD)
  • the process of finding general
    patterns/principles that summarize/explain a set
    of "observations".
  • The Knowledge Discovery in Databases (KDD)
  • Very Large DataBases (VLDB) have become the
    industry standard, making it impossible for human
    beings to mine the data "by hand" to look for
    interesting patterns. Automated tools are
    therefore needed to help to extract these
    patterns.

6
Background Cont.
  • The knowledge discovery in databases (KDD)
    consists of 3 steps
  • Data Integration (Data Warehousing)
  • Collecting the target data observations from
    the different data sources, removing noise from
    the observations, and integrating them into an
    appropriate format.
  • Data Mining (will be covered in detail)
  • Applying a concrete algorithm to find useful
    and novel patterns in the integrated data.

7
Background Cont.
  • Pattern Evaluation
  • Interpreting mined patterns, evaluating them
    according to usefulness/interestingness criteria,
    and possibly using visualization tools to aid in
    understanding the patterns graphically.
  • See KDD process graph below

8
Data Mining KDD process
Knowledge
Data mining the core of knowledge discovery
process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9
What Is Data Mining?
  • Data Mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    information (knowledge) or patterns from data in
    large databases, data warehouse or other
    information repositories
  • What is not data mining?
  • (Deductive) query processing.
  • Expert systems or Machine Learning/statistical
    programs
  • Online Analytical Processing (OLAP)
  • Software Agents
  • Data Mining Confluence of Multiple
    Disciplines

10
Database, OLAP,
High Performance Computing
Data Mining
Visualization
Machine Learning (AI)
Pattern recognition
Statistics Modeling
Information Science
11
Why Data Mining? Potential Applications
  • Database analysis and decision support
  • System (DSS)
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation.
  • Risk analysis and management
  • Forecasting, customer retention, improved
    underwriting, quality control, competitive
    analysis.
  • Text mining (Text Databases, documents), key
    words search and analysis.
  • DNA sequence analysis and gene expression.

12
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Useful Pattern
Visualization Techniques
Data Analyst
Data Mining
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
13
Why Data Mining? Potential Applications (Cont.)
  • Internet Web Surf-Aid (Web Mining)
  • IBM Surf-Aid applies data mining algorithms to
    Web access logs for market-related pages to
    discover customer preference and behavior pages,
    analyzing effectiveness of Web marketing,
    improving Web site organization, etc.
  • Sports
  • IBM Advanced Scout analyzed NBA game statistics
    (shots blocked, assists, and fouls) to gain
    competitive advantage for New York Knicks and
    Miami Heat.

14
The Data Mining Process

Data set
Data Mining System
training
Data Mining Algorithm
evaluation
model
prediction
Score model
Historical Training data
Results Pattern
New data
15
Examples of Discovered Patterns
  • Association rules find rules between
    different attributes
  • 98 of AOL users also have EBay accounts
  • Classification Classify data based on the
    values in a classifying attribute
  • People age less than 40 and salary gt 40,000
    trade on-line
  • Clustering Group data to form new classes
  • Users A and B access similar URLs, they belong to
    the same group, which has similar user profiles.

16
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Suggested approach Query-based, focused mining
  • Interestingness measures A pattern is
    interesting if it is
  • easily understood by humans
  • valid on new or test data with some degree of
    certainty.
  • potentially useful
  • novel, or validates some hypothesis that a user
    seeks to confirm

17
How can we Find All and Only Interesting Patterns?
  • Find all the interesting patterns Completeness.
  • Can a data mining system find all the interesting
    patterns?
  • Search only interesting patterns Optimization.
  • Can a data mining system find only the
    interesting patterns?
  • Approaches
  • First generate all the patterns and then filter
    out the uninteresting ones.
  • Generate only the interesting patterns --- mining
    query optimization

18
Data Mining Algorithms
  • Four common DM algorithm types
  • The k-Nearest Neighbor Algorithm (KNN)
  • Artificial Neural Network (ANN)
  • Rule Induction
  • Decision Trees

19
The k-Nearest Neighbor Algorithm (KNN)
  • A technique that classifies each record in a
    dataset based on a combination of the classes of
    the k record(s) most similar to it in a
    historical dataset
  • Use entire training database as the model
  • Find nearest data
  • point and do the
  • same thing as you
  • did for that record

-
.
-
-

-


xq


-
20
The k-Nearest Neighbor Algorithm (KNN) (Cont.)
  • Distance-weighted nearest neighbor algorithm.
  • Weight the contribution of each of the k
    neighbors according to their distance to the
    query point Xq.
  • giving greater weight to closer neighbors
  • Advantages
  • Calculate the mean values of the k nearest
    neighbors.
  • Robust to noisy data by averaging k-nearest
    neighbors.
  • Very easy to implement.
  • Disadvantage
  • Huge Models ( the entire training database )
  • More difficult to use in production.

21
Artificial neural networks Algorithm (ANN)
  • Non-linear predictive models that learn through
    training and loosely resemble biological neural
    networks in structure.
  • Inputs transformed through a network of simple
    processors
  • Processor combines (weighted) inputs and produces
    an output value

22
Artificial neural networks (Cont.)
mk
-
(Learning Rate)
x0
w0
x1
w1
f
Ã¥
output y
xn
wn
Input vector x
weight vector w
weighted sum
Activation function
  • The n-dimensional input vector x is mapped into
    variable y by means of the scalar product and a
    nonlinear function mapping

23
Multi layer perception of Artificial neural
networks
Output vector
Output nodes
Hidden nodes
Input nodes
Input vector xi
24
Artificial Neural Network evaluation
  • Advantages
  • prediction accuracy is generally high
  • robust,still works when training examples contain
    errors
  • Disadvantages
  • Key problem Difficult to understand
  • The neural network model is difficult
  • to understand
  • No intuitive understanding of results
  • Long training time
  • Although after training, process is very quick,
  • the training process itself is
    time-consuming
  • Significant pre-processing of data often required

25
Rule Induction
  • Rule Induction (rule-based prediction)
  • We first generate a set of rules from a data
    warehouse,
  • then use them to predict values for new data
    item.
  • It works much better on larger (and real)data
    sets, not just on samples of data.
  • Two phases
  • Rule discovery analyze a historical database
    and generate a set of rules by automatic
    discovery.
  • Prediction apply the rules to a new data set
    and match the rules to make predictions.

26
Rule Induction Example
Training Set
27
Rule Induction Example (Cont.)
  • 4 attributes
  • Outlook can be sunny, overcast, rainy 3
    cases
  • Temperature hot, mild, cool
    3 cases
  • Humidity high, normal
    2 cases
  • Windy true, false
    2 cases
  • 1 outcome class (N no class, P have class)
  • Totally we should have 332236 possible
    combinations, of which 14 are present in the
  • set of input examples.

28
Rule Induction Example (Cont.)
  • Some rules inducted from above dataset
  • Classification rules
  • If outlook sunny and humidity high then
    class n.
  • If outlook rainy and windy true then
    class n
  • if outlook overcast
    then class p
  • Association rules
  • If temperature cool then humidity
    normal
  • If windyfalse and classn then outlook
    sunny and

  • humidity high

29
What is a decision tree?
  • A decision tree is a flow-chart-like tree
    structure.
  • Internal node denotes a test on an attribute
  • Branch represents an outcome of the test
  • All tuples in branch have the same value for the
    tested attribute.
  • Leaf node represents class label or class label
    distribution.
  • A series of nested if/then rules
  • Understandable!

30
A Sample Decision Tree
The same Training set with Rule Induction
Outlook
sunny
rain
overcast
humidity
windy
P
true
false
high
normal
N
P
N
P
31
Another Example for DT
If x1 and y0 then class a If x0 and y1
then class a If x0 and y0 then class
b If x1 and y1 then class b
32
Another Example for DT
Credit Analysis
salary lt 20000

Yes
no

education in graduate

accept
no
yes


reject
accept
33
Decision-Tree Classification Methods
  • The basic top-down decision tree generation
    approach usually consists of two phases
  • Tree construction
  • At start, all the training examples are at the
    root.
  • Partition examples recursively based on selected
    attributes.
  • Tree pruning
  • Aiming at removing tree branches that may lead to
    errors when classifying test data (training data
    may contain noise, statistical fluctuations, )

34
How to construct a tree?
  • Algorithm
  • greedy algorithm
  • make optimal choice at each step select the best
    attribute for each tree node.
  • top-down recursive divide-and-conquer manner
  • from root to leaf
  • split node to several branches
  • for each branch, recursively run the algorithm

35
How to prune a tree
  • A decision tree constructed using the training
    data may have too many branches/leaf nodes.
  • Caused by noise, overfitting
  • May result poor accuracy for unseen samples
  • Prune the tree merge a subtree into a leaf node.
  • Using a set of data different from the training
    data.
  • At a tree node, if the accuracy without splitting
    is higher than the accuracy with splitting,
    replace the subtree with a leaf node, label it
    using the majority class.

36
How to use a tree?
  • Directly
  • test the attribute value of unknown sample
    against the tree.
  • A path is traced from root to a leaf which holds
    the label
  • Indirectly
  • decision tree is converted to classification
    rules
  • one rule is created for each path from the root
    to a leaf
  • IF-THEN is easier for humans to understand

37
Decision tree for a covering algorithm
38
Data Mining Algorithm Summary
  • KNN
  • Quick and easy
  • Models tend to be very large
  • ANN
  • Difficult to interpret
  • Can require significant amounts of time to train
  • Rule Induction
  • Understandable
  • Need to limit calculations
  • Decision Trees
  • Understandable
  • Relatively fast
  • Other DM Technologies
  • Genetic Algorithms
  • Rough sets
  • Bayesian networks
  • Mixture models
  • Many more...

39
Data Mining Research Trend
  • Text mining Text database and information
    retrieval
  • Multimedia data mining
  • OLAM (OLAP Mining)
  • Web mining (Data Mining and WWW)
  • E-commerce
  • Information retrieval (search)
  • Network management

40
Why Mine the Web?
  • Web A huge, widely-distributed, highly
    heterogeneous, semi-structured,
    hypertext/hypermedia, interconnected, evolving
    information repository.
  • Web is a huge collection of documents plus
  • Hyper-link information
  • Access and usage information
  • Enormous wealth of information on Web
  • Financial information (e.g. stock quotes)
  • Book/CD/Video stores (e.g. Amazon)
  • Restaurant information (e.g. Zagats)
  • Car prices (e.g. Carpoint)
  • Lots of data on user access patterns
  • Web logs contain sequence of URLs accessed by
    users

41
Why is Web Mining Different?
  • Huge The Web is a huge collection of documents
    except for
  • Hyper-link information
  • Access and usage information
  • DynamicThe Web is very dynamic
  • New pages are constantly being generated
  • Unstructured Complexity of Web pages far
    greater than text document collection
  • Challenge Develop new Web mining algorithms and
    adapt traditional data mining algorithms to
  • Exploit hyper-links and access patterns
  • Be incremental

42
Types of Web Mining
43
Web Mining Applications
  • E-commerce (Infrastructure)
  • Generate user profiles
  • Targetted advertizing
  • Fraud detection
  • Similar image retrieval
  • Information retrieval (Search) on the Web
  • Automated generation of topic hierarchies
  • Web knowledge bases
  • Extraction of schema for XML documents
  • Network Management
  • Performance management
  • Fault management

44
Existing Systems for Data Mining
  • IBM Intelligent Miner.
  • SAS Institute Enterprise Miner.
  • Silicon Graphics MineSet.
  • Integral Solutions Ltd. Clementine.
  • Information Discovery Inc.
  • Data Mining Suite.
  • DBMiner Technology Inc. DBMiner
  • Rutger DataMine, GMD Explora, Univ. Munich
    VisDB

45
Microsoft OLE DB for Data Mining
  • Microsoft OLE, OLE DB, OLE DB for OLAP and OLE DB
    for Data Mining
  • OLE DB for DM Standardization July 1999 to March
    2000
  • Microsoft SQL Server 2000 Analysis manager
  • Analysis manager consists of OLAP and Data Mining
  • Data mining two modules (Classification/Predictio
    n and clustering)
  • OLDB for DM Data mining providers (such as
    association modules and other classification or
    clustering modules)

46
Research Progress for Data Mining in the Last
Decade
  • Multi-dimensional data analysis Data warehouse
    and OLAP (on-line analytical processing)
  • Association, correlation, and causality analysis
  • Classification scalability and new approaches
  • Clustering and outlier analysis
  • Sequential patterns and time-series analysis
  • Text mining, Web mining and Weblog analysis
  • Spatial, multimedia, scientific data analysis
  • Data preprocessing and database compression
  • Data visualization and visual data mining

47
Conclusions
  • Knowledge Discovery in Databases (KDD)
  • Data warehouse An industry trend
  • DW stores a huge amount of subject-oriented,
    cleansed, integrated, consolidated, time-related
    data.
  • Data Mining A rich, promising, young field with
    broad applications and many challenging research
    issues. Good science - leading position in
    research community

48
Conclusions (Cont.)
  • Data mining tasks characterization, association,
    classification, clustering, prediction, sequence
    and pattern analysis, etc.
  • Data mining Algorithms
  • The k-Nearest Neighbor Algorithm (KNN)
  • Artificial Neural Network (ANN)
  • Rule Induction
  • Decision Trees
  • Research progress and trend in Data Mining

49
Future Work
  • Theoretical foundations of data mining.
  • Implementation and new data mining methodologies
  • A set of well-tuned, standard mining operators.
  • Data and knowledge visualization tools.
  • Integration of multiple data mining strategies.
  • Data mining in advanced information systems
  • Spatial, multimedia, Web-mining
  • Data mining applications
  • content browsing, query optimization,
    multi-resolution model, etc.
  • Social issues A threat to security and privacy.
Write a Comment
User Comments (0)
About PowerShow.com