Knowledge Discovery and Data Mining Lecture 1 - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Knowledge Discovery and Data Mining Lecture 1

Description:

Transform. values. Select DM. method (s) Create derived. attributes. Extract. knowledge ... Output y: whether it is man or woman. Two phases. Training. Testing ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 36
Provided by: hotu7
Category:

less

Transcript and Presenter's Notes

Title: Knowledge Discovery and Data Mining Lecture 1


1
Knowledge Discovery and Data Mining(Lecture 1)
2
Objectives
  • fundamental techniques of knowledge discovery and
    data mining (KDD)
  • issues in KDD practical use and tools
  • Common Data Mining Tasks

3
Overview of KDD and Data Mining
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
4

KDD A Definition
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
KDD is the automatic extraction of non-obvious,
hidden knowledge from large volumes of data.
106-1012 bytes never see the whole data set or
put it in the memory of computers
What knowledge? How to represent and use it?
Data mining algorithms?
5
Data, Information, Knowledge
We often see data as a string of bits, or
numbers and symbols, or objects which we
collect daily.
Information is data stripped of redundancy, and
reduced to the minimum necessary to characterize
the data.
Knowledge is integrated information, including
facts and their relations, which have been
perceived, discovered, or learned as our mental
pictures.
Knowledge can be considered
data at a high level of abstraction and
generalization.
6
From Data to Knowledge
Medical Data by Dr. Tsumoto, Tokyo Med. Dent.
Univ., 38 attributes
... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2,
1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-,
2852, 2148, 712, 97, 49, F,-,multiple,,2137,
negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0,
0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-,
10700,4,0,normal, abnormal, , 1080, 680, 400,
71, 59, F,-,ABPCCZX,, 70, negative, n, n, n,
BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0,
ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal,
abnormal, , 1124, 622, 502, 47, 63, F,
-,FMOXAMK, , 48, negative, n, n, n, BACTE(E),
BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38,
2, 0, 0, 15, -, , 12600, 4, 0,abnormal,
abnormal, , 41, 39, 2, 44, 57, F, -, ABPCCZX,
?, ? ,negative, ?, n, n, ABSCESS, VIRUS ...
Numerical attribute categorical attribute
missing values class labels
IF cell_poly lt 220 AND Risk n AND Loc_dat
AND Nausea gt 15 THEN Prediction VIRUS 87,5
confidence, predictive accuracy
7
Data Rich Knowledge Poor
How to acquire knowledge for
knowledge-based systems remains as the
main difficult and
crucial problem.
People gathered and stored so much data because
they think some valuable assets are implicitly
coded within it.
?
knowledge base
inference engine
Raw data is rarely of direct benefit.
Its true value depends on the ability to extract
information useful for decision support.
Tradition via knowledge engineers
Impractical Manual Data Analysis
New trend via automatic programs
8

Benefits of Knowledge Discovery
Value
Disseminate
DSS
Generate
MIS
EDP
Rapid Response
Volume
EDP Electronic Data Processing MIS Management
Information Systems DSS Decision Support Systems
9
Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
10
The KDD process
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data - Fayyad,
Platetsky-Shapiro, Smyth (1996)
11
The Knowledge Discovery Process
5
a step in the KDD process consisting of methods
that produce useful patterns or models from the
data, under some acceptable computational
efficiency limitations
4
3
2
1
KDD is inherently interactive and iterative
12
The KDD Process
Data organized by function
Create/select target database
Data warehousing
1
Select sampling technique and sample data
Supply missing values
Eliminate noisy data
2
Normalize values
Transform values
Create derived attributes
Find important attributes value ranges
4
3
Select DM task (s)
Select DM method (s)
Extract knowledge
Test knowledge
Refine knowledge
Query report generation Aggregation
sequences Advanced methods
Transform to different representation
5
13
Main Contributing Areas of KDD
Statistics
Infer info from data (deduction induction,
mainly numeric data)
data warehouses integrated data
OLAP On-Line Analytical Processing
KDD
Databases
Machine Learning
Store, access, search, update data (deduction)
Computer algorithms that improve automatically
through experience (mainly induction, symbolic
data)
14
Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining Methods
5. Challenges for KDD
15
Potential Applications
Manufacturing information
Business information
- Marketing and sales data analysis -
Investment analysis - Loan approval - Fraud
detection - etc.
- Controlling and scheduling - Network
management - Experiment result analysis - etc.
Personal information
Scientific information
- Sky survey cataloging - Biosequence Databases -
Geosciences Quakefinder - etc.
16
KDD Opportunity and Challenges
Competitive Pressure
Data Rich Knowledge Poor (the resource)
KDD
Data Mining Technology Mature
Enabling Technology (Interactive MIS, OLAP,
parallel computing, Web, etc.)
17
Lecture 1 Overview of KDD
1. What is KDD and Why ?
2. The KDD Process
3. KDD Applications
4. Data Mining and its Common Methods
5. Challenges for KDD
18
Data mining
  • Mining or discovery of new information in terms
    of patterns or rules from vast amounts of data
    based on the following techniques
  • Machine learning
  • Statistics
  • Neural networks
  • Genetic algorithms
  • Applications
  • Retail/Marketing
  • Consumer behaviour based on buying patterns
  • Finance
  • Creditworthiness of clients
  • Performance analysis of finance investments
  • Health care/Medicine
  • Effectiveness / side effects of treatments

19
2.1 Data Mining Strategies
Moh!
20
Supervised learning
  • Learning to assign objects to classes given
    examples
  • Learner (classifier)

A typical supervised text learning scenario.
21
Unsupervised Learning
  • The target goal is not pre-defined, i.e.
    gathering items in a database to groups where
    items in the same group are similar (clustering).
  • examples
  • Clustering
  • Association Rule Discovery

22
Common Tasks of Data Mining
finding the description of several predefined
classes and classify a data item into one of
them.
identifying a finite set of categories or
clusters to describe the data.
Clustering
Classification
Finding correlations Between items in a database
Association Rule
finding a compact description for a subset of
data
discovering the most significant changes in the
data
Deviation and change detection
Summarization
23
Classification
What factors determine cancerous cells?
Examples
General patterns
Data
Mining Algorithm
- Rule Induction - Decision tree - Neural Network
Classification Algorithm
Cancerous Cell Data
24
Classification
  • Learning is supervised.
  • The dependent variable is categorical.
  • Well-defined classes.
  • Current rather than future behavior.

25
Classification
  • Example
  • Input feature x face length, distance between
    eyes
  • Output y whether it is man or woman
  • Two phases
  • Training
  • Testing

New unclassified examples (or Unlabeled data)
Classified examples (or Labeled data)
learner
model
Classified examples
26
Classification
  • Mathematically, assume there is some function
    F(x) y producing the data. Given many pairs (x,
    y), find F

Distance between eyes
o
o
o
o
o
o





Face length
27
Classification
  • Issues
  • Expressiveness how flexible is the modeling
    method?
  • Scalability how fast can it learn a model from N
    features and M examples
  • Overfitting fitting the labeled examples too
    exactly gt often ends up degrading
    generalization performance. Usually caused by
    long training to derive a perfect model. Solution
    to overfitting is to use a test data set or
    cross-validation method.
  • N fold- Cross validation Dividing the training
    data into a n partitions, where learning the
    model will be done on n-1 partitions and testing
    the learned model will be carried out on the hold
    out block. The process is repeated n times
    randomly and then we average the results obtained
    in the n repetitions to obtain the accuracy of
    the model.
  • Generalization performance of learned model on
    unseen (test) examples (or beyond training
    examples)

28
Classification
  • Methods
  • K-Nearest neighbors
  • Decision trees
  • Rule Induction
  • Associative Classification
  • Naïve bayes and Bayesian belief networks
  • Artificial neural networks

29

Classification Rule Induction
What factors determine a cell is cancerous?
If Color light and Tails 1 and
Nuclei 2 Then Healthy Cell (certainty
92) If Color dark and Tails 2 and
Nuclei 2 Then Cancerous Cell (certainty
87)
30
Classification Decision Trees
Color dark
Color light
nuclei1
nuclei2
nuclei1
nuclei2
cancerous
healthy
tails1
tails2
tails1
tails2
healthy
cancerous
healthy
cancerous
31
Classification Neural Networks
What factors determine a cell is cancerous?
Color dark nuclei 1 tails 2
Healthy
Cancerous
32
Association Rules
  • Which feature values are commonly associated with
    each other?
  • Knowledge of the form
  • IF (feature1 value1) then (feature2 value2)
  • Sample applications
  • Market basket analysis
  • Recommender systems
  • Microarray analysis

33
Associations Rule Mining vs. Classification
34
Clustering
  • Also called unsupervised learning
  • Grouping data based on similarity
  • Need a similarity or distance function
  • Need a domain expert to interpret results

35
Clustering
  • Methods
  • Agglomerative clustering methods
  • K-mean clustering
  • SOM (Self organization map)
Write a Comment
User Comments (0)
About PowerShow.com