Knowledge Discovery in Data and Data Mining KDD

1 / 28
About This Presentation
Title:

Knowledge Discovery in Data and Data Mining KDD

Description:

... clusters of 'model' customers who share the same characteristics: interest, ... car=taurus. city=sf. age 45. likely. likely. unlikely. unlikely. Y. Y. Y. N. N. N ... – PowerPoint PPT presentation

Number of Views:1744
Avg rating:3.0/5.0
Slides: 29
Provided by: eick
Learn more at: http://www2.cs.uh.edu

less

Transcript and Presenter's Notes

Title: Knowledge Discovery in Data and Data Mining KDD


1
Knowledge Discovery in Data and Data Mining
(KDD)
Let us find something interesting!
  • Definition KDD is the non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data
    (Fayyad)
  • Frequently, the term data mining is used to refer
    to KDD.
  • Many commercial and experimental tools and tool
    suites are available (see http//www.kdnuggets.com
    /siftware.html)
  • Field is more dominated by industry than by
    research institutions

2
Making Sense of Data ---Knowledge Discovery and
Data Mining
  • Introduction to KDD (1 class)
  • Data Warehouses and OLAP (2 classes)
  • Association Rule Mining (1.5 classes)
  • Learning to Classify (1 class)
  • Other Techniques Clustering, Deviation
    Detection, Sequential Pattern Analysis (0.5
    classes)

3
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
4
General KDD Steps
Data sources
Selected/Preprocessed data
Transformed data
Extracted information
Knowledge
Select/preprocess
Transform
Data mine
Interpret/Evaluate/Assimilate
Data preparation
5
Popular KDD-Tasks
  • Classification (try to learn how to classify)
  • Clustering (finding groups of similar object)
  • Estimation and Prediction (try to learn a
    function that predicts an th value of a
    continuous output variable based on a set of
    input variables)
  • Bayesian and Dependency Networks
  • Deviation and Fraud Detection
  • Text Mining
  • Web Mining
  • Visualization
  • Transformation and Data Cleaning

6
KDD and Classical Data Analysis
  • KDD is less focused than data analysis in that it
    looks for interesting patterns in data classical
    data analysis centers on analyzing particular
    relationships in data. The notion of
    interestingness is a key concept in KDD.
    Classical data analysis centers more on
    generating and testing pre-structured hypothesis
    with respect to a given sample set.
  • KDD is more centered on analyzing large volumes
    of data (many fields, many tuples, many tables,
    ).
  • In a nutshell the the KDD-process consists of
    preprocessing (generating a target data set),
    data mining (finding something interesting in the
    data set), and post processing (representing the
    found pattern in understandable form and
    evaluated their usefulness in a particular
    domain) classical data analysis is less
    concerned with the the preprocessing step.
  • KDD involves the collaboration between multiple
    disciplines namely, statistics, AI,
    visualization, and databases.
  • KDD employs non-traditional data analysis
    techniques (neural networks, association rules,
    decision trees, fuzzy logic, evolutionary
    computing,).

7
Generating Models as an Example
  • The goal of model generation (sometimes also
    called predictive data mining) is the creation,
    evaluation, and use of models to make predictions
    and to understand the relationships between
    various variables that are described in a data
    collection. Typical example application include
  • generate a model to that predicts a students
    academic performance based on the applicants data
    such as the applicants past grades, test scores,
    past degree,
  • generate a model that predicts (based on economic
    data) which stocks to sell, hold, and buy.
  • generate a model to predict if a patient suffers
    from a particular disease based on a patients
    medical and other data.
  • Model generation centers on deriving a function
    that can predict a variable using the values of
    other variables vf(a1,,an)
  • Neural networks, decision trees, naïve Bayesian
    classifiers and networks, regression analysis and
    many other statistical techniques, fuzzy logic
    and neuro-fuzzy systems, association rules are
    the most popular model generation tools in the
    KDD area.
  • All model generation tools and environments
    employ the basic train-evaluate-predict cycle.

8
Why Do We Need so manyData Mining / Analysis
Techniques?
  • No generally good technique exists.
  • Different methods make different assumptions with
    respect to the data set to be analyzed (to be
    discussed on the next transparency)
  • Cross fertilization between different methods is
    desirable and frequently helpful in obtaining a
    deeper understanding of the analyzed dataset.

9
Motivation Necessity is the Mother of
Invention
  • Data explosion problem
  • Automated data collection tools and mature
    database technology lead to tremendous amounts of
    data stored in databases, data warehouses and
    other information repositories
  • We are drowning in data, but starving for
    knowledge!
  • Solution Data warehousing and data mining
  • Data warehousing and on-line analytical
    processing
  • Extraction of interesting knowledge (rules,
    regularities, patterns, constraints) from data
    in large databases

10
Why Data Mining? Potential Applications
  • Database analysis and decision support
  • Market analysis and management
  • target marketing, customer relation management,
    market basket analysis, cross selling, market
    segmentation
  • Risk analysis and management
  • Forecasting, customer retention, improved
    underwriting, quality control, competitive
    analysis
  • Fraud detection and management
  • Other Applications
  • Text mining (news group, email, documents) and
    Web analysis.
  • Intelligent query answering

11
Market Analysis and Management
  • Where are the data sources for analysis?
  • Credit card transactions, loyalty cards, discount
    coupons, customer complaint calls, plus (public)
    lifestyle studies
  • Target marketing
  • Find clusters of model customers who share the
    same characteristics interest, income level,
    spending habits, etc.
  • Determine customer purchasing patterns over time
  • Conversion of single to a joint bank account
    marriage, etc.
  • Cross-market analysis
  • Associations/co-relations between product sales
  • Prediction based on the association information

12
Fraud Detection and Management
  • Applications
  • widely used in health care, retail, credit card
    services, telecommunications (phone card fraud),
    etc.
  • Approach
  • use historical data to build models of fraudulent
    behavior and use data mining to help identify
    similar instances
  • Examples
  • auto insurance detect a group of people who
    stage accidents to collect on insurance
  • money laundering detect suspicious money
    transactions (US Treasury's Financial Crimes
    Enforcement Network)
  • medical insurance detect professional patients
    and ring of doctors and ring of references

13
Other Applications
  • Sports
  • IBM Advanced Scout analyzed NBA game statistics
    (shots blocked, assists, and fouls) to gain
    competitive advantage for New York Knicks and
    Miami Heat
  • Astronomy
  • JPL and the Palomar Observatory discovered 22
    quasars with the help of data mining
  • Internet Web Surf-Aid
  • IBM Surf-Aid applies data mining algorithms to
    Web access logs for market-related pages to
    discover customer preference and behavior pages,
    analyzing effectiveness of Web marketing,
    improving Web site organization, etc.

14
Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
15
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
16
An OLAM Architecture
Layer4 User Interface
Mining query
Mining result
User GUI API
OLAM Engine
OLAP Engine
Layer3 OLAP/OLAM
Data Cube API
Layer2 MDDB
MDDB
Meta Data
Database API
FilteringIntegration
Filtering
Layer1 Data Repository
Data Warehouse
Data cleaning
Databases
Data integration
17
Example Decision Tree Approach
18
Decision Tree Approach2
19
Decision Trees
  • Example
  • Conducted survey to see what customers were
    interested in new model car
  • Want to select customers for advertising campaign

training set
20
One Possibility
ageY
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
21
Another Possibility
cartaurus
Y
N
citysf
ageY
Y
N
N
likely
unlikely
likely
unlikely
22
Example Nearest Neighbor Approach
23
Clustering
income
education
age
24
Another Example Text
  • Each document is a vector
  • e.g., contains words 1,4,5,...
  • Clusters contain similar documents
  • Useful for understanding, searching documents

sports
international news
business
25
Issues
  • Given desired number of clusters?
  • Finding best clusters
  • Are clusters semantically meaningful?
  • e.g., yuppies cluster?
  • Using clusters for disk storage

26
Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
  • Trend Products p5, p8 often bough together
  • Trend Customer 12 likes product p9

27
Summary Data Mining
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems
  • Major issues in data mining

28
Where to Find References?
  • Data mining and KDD (SIGKDD member CDROM)
  • Conference proceedings KDD, and others, such as
    PKDD, PAKDD, etc.
  • Journal Data Mining and Knowledge Discovery
  • Database field (SIGMOD member CD ROM)
  • Conference proceedings ACM-SIGMOD, ACM-PODS,
    VLDB, ICDE, EDBT, DASFAA
  • Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
  • AI and Machine Learning
  • Conference proceedings Machine learning, AAAI,
    IJCAI, etc.
  • Journals Machine Learning, Artificial
    Intelligence, etc.
  • Statistics
  • Conference proceedings Joint Stat. Meeting, etc.
  • Journals Annals of statistics, etc.
  • Visualization
  • Conference proceedings CHI, etc.
  • Journals IEEE Trans. visualization and computer
    graphics, etc.
Write a Comment
User Comments (0)