Introduction to Data Mining Chapter 4 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Introduction to Data Mining Chapter 4

Description:

Globalised world. Vast amount of information available. 5. What is an information ... Assign Cheat to 'No' 30. Decision Tree Classification Task. Decision Tree. 31 ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 49
Provided by: SEAS80
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Data Mining Chapter 4


1
Introduction to Data MiningChapter 4

2
Chapter 4 Outline
  • Background
  • Information is Power
  • Knowledge is Power
  • Data Mining

3
Introduction

4
Information is Power
  • Relevant
  • Right Information
  • Globalised world
  • Vast amount of information available

5
What is an information
  • a collection of data
  • The act of human analysis and interpretation of
    activities
  • Decomposing it into various components and
    tackling them

6
What is Knowledge?
  • The act of human synthesis and evaluation of
    information
  • Integration of the relevant components and form
    as a relevant whole system.

7
Why Mine Data? Commercial Viewpoint
  • Lots of data is being collected and warehoused
  • Web data, e-commerce
  • purchases at department/grocery stores
  • Bank/Credit Card transactions
  • Computers have become cheaper and more powerful
  • Competitive Pressure is Strong
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

8
Why Mine Data? Scientific Viewpoint
  • Data collected and stored at enormous speeds
    (GB/hour)
  • remote sensors on a satellite
  • telescopes scanning the skies
  • microarrays generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Traditional techniques infeasible for raw data
  • Data mining may help scientists
  • in classifying and segmenting data
  • in Hypothesis Formation

9
Data Mining Definition I
  • The nontrivial extraction of hidden, previously
    unidentified, and potentially valuable knowledge
    from data
  • A variety of techniques such as neural networks,
    decision trees or standard statistical techniques
    to identify nuggets of information or
    decision-making knowledge in bodies of data, and
    extracting these in such a way that they can be
    put to use in areas such as decision support,
    prediction, forecasting, and estimation.

10
Data Mining Definition II
  • Finding hidden information in a database

11
Hidden Information
  • Number of years of experiences
  • Great secret recipes
  • Success Factors

12
Origins of Data Mining
  • Draws ideas from machine learning/AI, pattern
    recognition, statistics, and database systems
  • Traditional Techniquesmay be unsuitable due to
  • Enormity of data
  • High dimensionality of data
  • Heterogeneous, distributed nature of data

Statistics/AI
Machine Learning/ Pattern Recognition
Data Mining
Database systems
13
What is (not) Data Mining?
  • What is Data Mining?
  • Certain names are more prevalent in certain US
    locations (OBrien, ORurke, OReilly in Boston
    area)
  • Group together similar documents returned by
    search engine according to their context (e.g.
    Amazon rainforest, Amazon.com,)
  • What is not Data Mining?
  • Look up phone number in phone directory
  • Query a Web search engine for information about
    Amazon

14
Database Processing vs. Data Mining Processing
  • Query
  • Poorly defined
  • No precise query language
  • Query
  • Well defined
  • SQL
  • Data
  • Operational data
  • Data
  • Not operational data
  • Output
  • Precise
  • Subset of database
  • Output
  • Fuzzy
  • Not a subset of database

15
Query Examples
  • Database
  • Data Mining
  • Find all credit applicants with surname name of
    Lee.
  • Identify customers who have purchased more than
    100,000 in the last year.
  • Find all customers who have purchased bread
  • Find all credit applicants who are good credit
    risks. (classification)
  • Identify customers with similar eating habits.
    (Clustering)
  • Find all items which are frequently purchased
    with bread. (association rules)

16
Data Mining Models and Tasks
17
Classification Definition
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.

18
Illustrating Classification Task
19
Examples of Classification Task
  • Predicting tumor cells as benign or malignant
  • Classifying credit card transactions as
    legitimate or fraudulent
  • Classifying secondary structures of protein as
    alpha-helix, beta-sheet, or random coil
  • Categorizing news stories as finance, weather,
    entertainment, sports, etc

20
Classification Techniques
  • Decision Tree based Methods
  • Rule-based Methods
  • Memory based reasoning
  • Neural Networks
  • Naïve Bayes and Bayesian Belief Networks
  • Support Vector Machines

21
Example of a Decision Tree
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
Model Decision Tree
Training Data
22
Another Example of Decision Tree
categorical
categorical
continuous
class
Single, Divorced
MarSt
Married
Refund
NO
No
Yes
TaxInc
lt 80K
gt 80K
YES
NO
There could be more than one tree that fits the
same data!
23
Decision Tree Classification Task
Decision Tree
24
Apply Model to Test Data
Test Data
Start from the root of tree.
25
Apply Model to Test Data
Test Data
26
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
27
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
28
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
29
Apply Model to Test Data
Test Data
Refund
Yes
No
MarSt
NO
Assign Cheat to No
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
30
Decision Tree Classification Task
Decision Tree
31
What is Cluster Analysis?
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

32
Applications of Cluster Analysis
  • Understanding
  • Group related documents for browsing, group genes
    and proteins that have similar functionality, or
    group stocks with similar price fluctuations
  • Summarization
  • Reduce the size of large data sets

33
What is not Cluster Analysis?
  • Supervised classification
  • Have class label information
  • Simple segmentation
  • Dividing students into different registration
    groups alphabetically, by last name
  • Results of a query
  • Groupings are a result of an external
    specification
  • Graph partitioning
  • Some mutual relevance and synergy, but areas are
    not identical

34
Notion of a Cluster can be Ambiguous
35
Types of Clusterings
  • A clustering is a set of clusters
  • Important distinction between hierarchical and
    partitional sets of clusters
  • Partitional Clustering
  • A division data objects into non-overlapping
    subsets (clusters) such that each data object is
    in exactly one subset
  • Hierarchical clustering
  • A set of nested clusters organized as a
    hierarchical tree

36
Partitional Clustering
Original Points
37
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
38
Association Rules
  • Association Rules are a data mining technique and
    complement market basket analysis.
  • All association rules are unidirectional and take
    the following form
  • Left-hand side rule IMPLIES Right-hand side rule
  • Both left hand side and the right-hand side of
    the rule may contain multiple items or
    combination of items such as following
  • Yellow Peppers IMPLIES Red Peppers, Bananas, and
    Bakery
  • Associations are written as A B, where A is
    called antecedent or left-hand side(LHS) and B is
    called consequent or right-hand side(RHS).
  • Ex If people buy printer then they buy
    catridge
  • The antecedent is buy printer and the
    consequent is buy catridge

39
Association Rules
  • Market Basket Analysis
  • -Necessary to have a list of transactions and
    what was purchased in each one.
  • -Ex
  • Transaction 1 Frozen Pizza, Cola, Milk
  • Transaction 2 Milk, potato chips,
  • Transaction 3 Cola, Frozen pizza
  • Transaction 4 Milk, pretzels
  • Transaction 5 Cola, pretzels

40
Association Rules
41
Association Rules
  • Measures of Association
  • Support- the support measure refers to the
    percentage of baskets in the analysis where the
    rule is true, that is where both the left-hand
    side and the right-hand side of the association
    are found.
  • Confidence
  • The percentage of baskets from the analysis
    having the left-hand side item that also contain
    the right-hand side item is found via the
    confidence measure. This measure is different
    from support in that confidence is the
    probability that the right-hand side item is
    present given that we know the left-hand side
    item is in the basket.
  • Calculated as a ratio
  • (frequency of A and B)/(frequency of A)

42
Association Rules
  • Measures of Association
  • -The support measure
  • for the rule
  • Cola IMPLIES Frozen Pizza is 40
  • Frozen Pizza IMPLIES Cola is 40
  • single item
  • Milk is 60
  • (Note support considers only the combination and
    not the direction.)

43
Association Rules
  • Measures of Association
  • Confidence
  • Milk IMPLIES Potato Chips has confidence
  • (frequency of A and B) / (frequency of A)
  • 20 / 60
  • 33

44
Data Mining vs. KDD
  • Knowledge Discovery in Databases (KDD) process
    of finding useful information and patterns in
    data.
  • Data Mining Use of algorithms to extract the
    information and patterns derived by the KDD
    process.

45
KDD Process
Modified from FPSS96C
  • Selection ( Pre-Mining 1) Obtain data from
    various sources.
  • Preprocessing (Pre-Mining 2) Cleanse data.
  • Transformation (Pre-Mining 3) Convert to common
    format. Transform to new format.
  • Data Mining Obtain desired results.
  • Interpretation/Evaluation (Post-Mining) Present
    results to user in meaningful manner.

46
KDD Process Ex Web Log
  • Selection
  • Select log data (dates and locations) to use
  • Preprocessing
  • Remove identifying URLs
  • Remove error logs
  • Transformation
  • Sessionize (sort and group)
  • Data Mining
  • Identify and count patterns
  • Construct data structure
  • Interpretation/Evaluation
  • Identify and display frequently accessed
    sequences.
  • Potential User Applications
  • Cache prediction
  • Personalisation

47
Data Mining Development
  • Similarity Measures
  • Hierarchical Clustering
  • IR Systems
  • Imprecise Queries
  • Textual Data
  • Web Search Engines
  • Relational Data Model
  • SQL
  • Association Rule Algorithms
  • Data Warehousing
  • Scalability Techniques
  • Bayes Theorem
  • Regression Analysis
  • EM Algorithm
  • K-Means Clustering
  • Time Series Analysis
  • Algorithm Design Techniques
  • Algorithm Analysis
  • Data Structures
  • Neural Networks
  • Decision Tree Algorithms

48
Data mining What it cant do
  • tell the value of the patterns to the
    organization
  • replace skilled business analysts or managers
  • automatically discover solutions without guidance
Write a Comment
User Comments (0)
About PowerShow.com