Title: Promising Newer Technologies to Cope with the
1Promising Newer Technologies to Cope with the
Information Flood
- Knowledge Discovery and Data Mining (KDD)
- Agent-based Technologies
- Ontologies and Knowledge Brokering
- Non-traditional data analysis techniques
Model Generation As an Example To Explain
/ Discuss Technologies
2Knowledge Discovery in Data and Data Mining
(KDD)
Let us find something interesting!
- Definition KDD is the non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data
(Fayyad) - Frequently, the term data mining is used to refer
to KDD. - Many commercial and experimental tools and tool
suites are available (see http//www.kdnuggets.com
/siftware.html) - Field is more dominated by industry than by
research institutions
3Making Sense of Data ---Knowledge Discovery and
Data Mining
- 2005 Lectures
- Introduction to KDD
- Similarity Assessment
- Clustering
- Classification (very very brief)
- Association Rule Mining
- Spatial Databases and Spatial Data Mining
- Data Warehouses and OLAP
4Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
5General KDD Steps
Data sources
Selected/Preprocessed data
Transformed data
Extracted information
Knowledge
Select/preprocess
Transform
Data mine
Interpret/Evaluate/Assimilate
Data preparation
6Popular KDD-Tasks
- Classification (try to learn how to classify)
- Clustering (finding groups of similar object)
- Estimation and Prediction (try to learn a
function that predicts an th value of a
continuous output variable based on a set of
input variables) - Bayesian and Dependency Networks
- Deviation and Fraud Detection
- Text Mining
- Web Mining
- Visualization
- Transformation and Data Cleaning
7KDD and Classical Data Analysis
- KDD is less focused than data analysis in that it
looks for interesting patterns in data classical
data analysis centers on analyzing particular
relationships in data. The notion of
interestingness is a key concept in KDD.
Classical data analysis centers more on
generating and testing pre-structured hypothesis
with respect to a given sample set. - KDD is more centered on analyzing large volumes
of data (many fields, many tuples, many tables,
). - In a nutshell the the KDD-process consists of
preprocessing (generating a target data set),
data mining (finding something interesting in the
data set), and post processing (representing the
found pattern in understandable form and
evaluated their usefulness in a particular
domain) classical data analysis is less
concerned with the the preprocessing step. - KDD involves the collaboration between multiple
disciplines namely, statistics, AI,
visualization, and databases. - KDD employs non-traditional data analysis
techniques (neural networks, association rules,
decision trees, fuzzy logic, evolutionary
computing,).
8Generating Models as an Example
- The goal of model generation (sometimes also
called predictive data mining) is the creation,
evaluation, and use of models to make predictions
and to understand the relationships between
various variables that are described in a data
collection. Typical example application include - generate a model to that predicts a students
academic performance based on the applicants data
such as the applicants past grades, test scores,
past degree, - generate a model that predicts (based on economic
data) which stocks to sell, hold, and buy. - generate a model to predict if a patient suffers
from a particular disease based on a patients
medical and other data. - Model generation centers on deriving a function
that can predict a variable using the values of
other variables vf(a1,,an) - Neural networks, decision trees, naïve Bayesian
classifiers and networks, regression analysis and
many other statistical techniques, fuzzy logic
and neuro-fuzzy systems, association rules are
the most popular model generation tools in the
KDD area. - All model generation tools and environments
employ the basic train-evaluate-predict cycle.
9Why Do We Need so manyData Mining / Analysis
Techniques?
- No generally good technique exists.
- Different methods make different assumptions with
respect to the data set to be analyzed (to be
discussed on the next transparency) - Cross fertilization between different methods is
desirable and frequently helpful in obtaining a
deeper understanding of the analyzed dataset.
10Motivation Necessity is the Mother of
Invention
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases, data warehouses and
other information repositories - We are drowning in data, but starving for
knowledge! - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data
in large databases
11Why Data Mining? Potential Applications
- Database analysis and decision support
- Market analysis and management
- target marketing, customer relation management,
market basket analysis, cross selling, market
segmentation - Risk analysis and management
- Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis - Fraud detection and management
- Other Applications
- Text mining (news group, email, documents) and
Web analysis. - Intelligent query answering
12Market Analysis and Management
- Where are the data sources for analysis?
- Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public)
lifestyle studies - Target marketing
- Find clusters of model customers who share the
same characteristics interest, income level,
spending habits, etc. - Determine customer purchasing patterns over time
- Conversion of single to a joint bank account
marriage, etc. - Cross-market analysis
- Associations/co-relations between product sales
- Prediction based on the association information
13Fraud Detection and Management
- Applications
- widely used in health care, retail, credit card
services, telecommunications (phone card fraud),
etc. - Approach
- use historical data to build models of fraudulent
behavior and use data mining to help identify
similar instances - Examples
- auto insurance detect a group of people who
stage accidents to collect on insurance - money laundering detect suspicious money
transactions (US Treasury's Financial Crimes
Enforcement Network) - medical insurance detect professional patients
and ring of doctors and ring of references
14Other Applications
- Sports
- IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain
competitive advantage for New York Knicks and
Miami Heat - Astronomy
- JPL and the Palomar Observatory discovered 22
quasars with the help of data mining - Internet Web Surf-Aid
- IBM Surf-Aid applies data mining algorithms to
Web access logs for market-related pages to
discover customer preference and behavior pages,
analyzing effectiveness of Web marketing,
improving Web site organization, etc.
15Data Mining and Business Intelligence
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
16Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
17Example Decision Tree Approach
18Decision Tree Approach2
19Decision Trees
- Example
- Conducted survey to see what customers were
interested in new model car - Want to select customers for advertising campaign
training set
20One Possibility
ageY
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
21Another Possibility
cartaurus
Y
N
citysf
ageY
Y
N
N
likely
unlikely
likely
unlikely
22Example Nearest Neighbor Approach
23Clustering
income
education
age
24Another Example Text
- Each document is a vector
- e.g., contains words 1,4,5,...
- Clusters contain similar documents
- Useful for understanding, searching documents
sports
international news
business
25Issues
- Given desired number of clusters?
- Finding best clusters
- Are clusters semantically meaningful?
- e.g., yuppies cluster?
- Using clusters for disk storage
26Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
- Example Rules
- age(X, 20..29) income(X, 20..29K) ?
buys(X, PC) support 2, confidence 60 - buys(x, p2) buys(x,p5) ? bus(x,p8) 1, 85
27Characteristics and Assumptions of Popular Data
Mining/Analysis Techniques
- Distance based approaches (assume that a distance
function with respect to the objects in the
dataset exists) vs. order-based approaches (just
use the ordering of values in their decision
making 321 is indistinguishable from
2.0121.99) - Approaches that make no assumptions / assume a
particular distribution of the data in the
underlying dataset. - Differences in employed approximation techniques
- Rectangular vs. other approximation
- Linear vs. non-linear approximations
- Sensitivity to redundant attributes (variables)
- Sensitivity to irrelevant attributes
- Sensitivity to attributes of different degrees of
importance - Different Training Performance / Testing
Performance - What does the learnt function tell us about the
analyzed data set? How difficult is it to
understand the learnt function? - Deterministic / non-deterministic approaches
- Stability of the obtained results
28Summary KDD
- KDD discovering interesting patterns from large
amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories - Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc. - Multi-disciplinary activity
- Important Issues KDD-methodologies and
user-interactions, scalability, tool use and tool
integration, preprocessing, interpretation of
results, finding good parameter settings when
running data mining tools,
29Where to Find References?
- Data mining and KDD (SIGKDD member CDROM)
- Conference proceedings KDD, and others, such as
PKDD, PAKDD, etc. - Journal Data Mining and Knowledge Discovery
- Database field (SIGMOD member CD ROM)
- Conference proceedings ACM-SIGMOD, ACM-PODS,
VLDB, ICDE, EDBT, DASFAA - Journals ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc.
- AI and Machine Learning
- Conference proceedings Machine learning, AAAI,
IJCAI, etc. - Journals Machine Learning, Artificial
Intelligence, etc. - Statistics
- Conference proceedings Joint Stat. Meeting, etc.
- Journals Annals of statistics, etc.
- Visualization
- Conference proceedings CHI, etc.
- Journals IEEE Trans. visualization and computer
graphics, etc.