Title: Introduction to Data Mining Chapter 4
1Introduction to Data MiningChapter 4
2 Chapter 4 Outline
- Background
- Information is Power
- Knowledge is Power
- Data Mining
-
3Introduction
4Information is Power
- Relevant
- Right Information
- Globalised world
- Vast amount of information available
5What is an information
- a collection of data
- The act of human analysis and interpretation of
activities - Decomposing it into various components and
tackling them
6What is Knowledge?
- The act of human synthesis and evaluation of
information - Integration of the relevant components and form
as a relevant whole system.
7Why Mine Data? Commercial Viewpoint
- Lots of data is being collected and warehoused
- Web data, e-commerce
- purchases at department/grocery stores
- Bank/Credit Card transactions
- Computers have become cheaper and more powerful
- Competitive Pressure is Strong
- Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
8Why Mine Data? Scientific Viewpoint
- Data collected and stored at enormous speeds
(GB/hour) - remote sensors on a satellite
- telescopes scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of
data - Traditional techniques infeasible for raw data
- Data mining may help scientists
- in classifying and segmenting data
- in Hypothesis Formation
9Data Mining Definition I
- The nontrivial extraction of hidden, previously
unidentified, and potentially valuable knowledge
from data - A variety of techniques such as neural networks,
decision trees or standard statistical techniques
to identify nuggets of information or
decision-making knowledge in bodies of data, and
extracting these in such a way that they can be
put to use in areas such as decision support,
prediction, forecasting, and estimation.
10Data Mining Definition II
- Finding hidden information in a database
11Hidden Information
- Number of years of experiences
- Great secret recipes
- Success Factors
12Origins of Data Mining
- Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems - Traditional Techniquesmay be unsuitable due to
- Enormity of data
- High dimensionality of data
- Heterogeneous, distributed nature of data
Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
13What is (not) Data Mining?
- What is Data Mining?
-
- Certain names are more prevalent in certain US
locations (OBrien, ORurke, OReilly in Boston
area) - Group together similar documents returned by
search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)
- What is not Data Mining?
- Look up phone number in phone directory
-
- Query a Web search engine for information about
Amazon
14Database Processing vs. Data Mining Processing
- Query
- Poorly defined
- No precise query language
- Data
- Not operational data
- Output
- Precise
- Subset of database
- Output
- Fuzzy
- Not a subset of database
15Query Examples
- Find all credit applicants with surname name of
Lee.
- Identify customers who have purchased more than
100,000 in the last year.
- Find all customers who have purchased bread
- Find all credit applicants who are good credit
risks. (classification)
- Identify customers with similar eating habits.
(Clustering)
- Find all items which are frequently purchased
with bread. (association rules)
16Data Mining Models and Tasks
17Classification Definition
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
18Illustrating Classification Task
19Examples of Classification Task
- Predicting tumor cells as benign or malignant
- Classifying credit card transactions as
legitimate or fraudulent - Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil - Categorizing news stories as finance, weather,
entertainment, sports, etc
20Classification Techniques
- Decision Tree based Methods
- Rule-based Methods
- Memory based reasoning
- Neural Networks
- Naïve Bayes and Bayesian Belief Networks
- Support Vector Machines
21Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
22Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
23Decision Tree Classification Task
Decision Tree
24Apply Model to Test Data
Test Data
Start from the root of tree.
25Apply Model to Test Data
Test Data
26Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
27Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
28Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
29Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
30Decision Tree Classification Task
Decision Tree
31What is Cluster Analysis?
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
32Applications of Cluster Analysis
- Understanding
- Group related documents for browsing, group genes
and proteins that have similar functionality, or
group stocks with similar price fluctuations - Summarization
- Reduce the size of large data sets
33What is not Cluster Analysis?
- Supervised classification
- Have class label information
- Simple segmentation
- Dividing students into different registration
groups alphabetically, by last name - Results of a query
- Groupings are a result of an external
specification - Graph partitioning
- Some mutual relevance and synergy, but areas are
not identical
34Notion of a Cluster can be Ambiguous
35Types of Clusterings
- A clustering is a set of clusters
- Important distinction between hierarchical and
partitional sets of clusters - Partitional Clustering
- A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset - Hierarchical clustering
- A set of nested clusters organized as a
hierarchical tree
36Partitional Clustering
Original Points
37Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
38Association Rules
- Association Rules are a data mining technique and
complement market basket analysis. - All association rules are unidirectional and take
the following form - Left-hand side rule IMPLIES Right-hand side rule
- Both left hand side and the right-hand side of
the rule may contain multiple items or
combination of items such as following - Yellow Peppers IMPLIES Red Peppers, Bananas, and
Bakery - Associations are written as A B, where A is
called antecedent or left-hand side(LHS) and B is
called consequent or right-hand side(RHS). - Ex If people buy printer then they buy
catridge - The antecedent is buy printer and the
consequent is buy catridge -
-
39Association Rules
- Market Basket Analysis
- -Necessary to have a list of transactions and
what was purchased in each one. - -Ex
- Transaction 1 Frozen Pizza, Cola, Milk
- Transaction 2 Milk, potato chips,
- Transaction 3 Cola, Frozen pizza
- Transaction 4 Milk, pretzels
- Transaction 5 Cola, pretzels
40Association Rules
41Association Rules
- Measures of Association
- Support- the support measure refers to the
percentage of baskets in the analysis where the
rule is true, that is where both the left-hand
side and the right-hand side of the association
are found. - Confidence
- The percentage of baskets from the analysis
having the left-hand side item that also contain
the right-hand side item is found via the
confidence measure. This measure is different
from support in that confidence is the
probability that the right-hand side item is
present given that we know the left-hand side
item is in the basket. - Calculated as a ratio
- (frequency of A and B)/(frequency of A)
42Association Rules
- Measures of Association
- -The support measure
- for the rule
- Cola IMPLIES Frozen Pizza is 40
- Frozen Pizza IMPLIES Cola is 40
- single item
- Milk is 60
- (Note support considers only the combination and
not the direction.)
43Association Rules
- Measures of Association
- Confidence
- Milk IMPLIES Potato Chips has confidence
- (frequency of A and B) / (frequency of A)
- 20 / 60
- 33
44Data Mining vs. KDD
- Knowledge Discovery in Databases (KDD) process
of finding useful information and patterns in
data. - Data Mining Use of algorithms to extract the
information and patterns derived by the KDD
process.
45KDD Process
Modified from FPSS96C
- Selection ( Pre-Mining 1) Obtain data from
various sources. - Preprocessing (Pre-Mining 2) Cleanse data.
- Transformation (Pre-Mining 3) Convert to common
format. Transform to new format. - Data Mining Obtain desired results.
- Interpretation/Evaluation (Post-Mining) Present
results to user in meaningful manner.
46KDD Process Ex Web Log
- Selection
- Select log data (dates and locations) to use
- Preprocessing
- Remove identifying URLs
- Remove error logs
- Transformation
- Sessionize (sort and group)
- Data Mining
- Identify and count patterns
- Construct data structure
- Interpretation/Evaluation
- Identify and display frequently accessed
sequences. - Potential User Applications
- Cache prediction
- Personalisation
47Data Mining Development
- Similarity Measures
- Hierarchical Clustering
- IR Systems
- Imprecise Queries
- Textual Data
- Web Search Engines
- Relational Data Model
- SQL
- Association Rule Algorithms
- Data Warehousing
- Scalability Techniques
- Bayes Theorem
- Regression Analysis
- EM Algorithm
- K-Means Clustering
- Time Series Analysis
- Algorithm Design Techniques
- Algorithm Analysis
- Data Structures
- Neural Networks
- Decision Tree Algorithms
48Data mining What it cant do
- tell the value of the patterns to the
organization - replace skilled business analysts or managers
- automatically discover solutions without guidance