Introduction to Data Mining - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Introduction to Data Mining

Description:

Introduction to Data Mining Donghui Zhang CCIS, Northeastern University http://www.cs.uiuc.edu/~hanj The current talk was extracted and modified from Dr. Han ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 45

Provided by: Jiaw164

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Data Mining

1
Introduction to Data Mining

Donghui Zhang
CCIS, Northeastern University

2
http//www.cs.uiuc.edu/hanj

The current talk slide was extracted and modified
from Dr. Hans lecture slides.

3
Motivation

Data explosion problem
Automated data collection tools and mature
database technology lead to tremendous amounts of
data accumulated and/or to be analyzed in
databases, data warehouses, and other information
repositories
We are drowning in data, but starving for
knowledge!
Solution Data warehousing and data mining
Data warehousing and on-line analytical
processing
Mining interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases

4
Evolution of Database Technology

1960s
Data collection, database creation, IMS and
network DBMS
1970s
Relational data model, relational DBMS
implementation
1980s
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific,
engineering, etc.)
1990s
Data mining, data warehousing, multimedia
databases, and Web databases
2000s
Stream data management and mining
Data mining with a variety of applications
Web technology and global information systems

5
Data Mining Confluence of Multiple Disciplines
Database Systems
Statistics
Data Mining
Machine Learning
Visualization
Algorithm
Other Disciplines
6
What Is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
Data mining a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc.
Watch out Is everything data mining?
(Deductive) query processing.
Expert systems or small ML/statistical programs

7
Why Data Mining?Potential Applications

Data analysis and decision support
Market analysis and management
Target marketing, customer relationship
management (CRM), market basket analysis, cross
selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis
Fraud detection and detection of unusual patterns
(outliers)
Other Applications
Text mining (news group, email, documents) and
Web mining
Stream data mining
DNA and bio-data analysis

8
Data Mining A KDD Process
Knowledge

Data miningcore of knowledge discovery process

Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
9
Steps of a KDD Process

Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set data selection
Data cleaning and preprocessing (may take 60 of
effort!)
Data reduction and transformation
Find useful features, dimensionality/variable
reduction, invariant representation.
Choosing functions of data mining
summarization, classification, regression,
association, clustering.
Choosing the mining algorithm(s)
Data mining search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant
patterns, etc.
Use of discovered knowledge

10
Architecture Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
11
Data Mining On What Kinds of Data?

Relational database
Data warehouse
Transactional database
Advanced database and information repository
Object-relational database
Spatial and temporal data
Time-series data
Stream data
Multimedia database
Heterogeneous and legacy database
Text databases WWW

12
Data Mining Functionalities

Concept description Characterization and
discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality)
Diaper à Beer 0.5, 75
Classification and Prediction
Construct models (functions) that describe and
distinguish classes or concepts for future
prediction
E.g., classify countries based on climate, or
classify cars based on gas mileage
Presentation decision-tree, classification rule,
neural network
Predict some unknown or missing numerical values

13
Data Mining Functionalities (2)

Cluster analysis
Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns
Maximizing intra-class similarity minimizing
interclass similarity
Mining complex types of data

14
1. Concept Description

Descriptive vs. predictive data mining
Descriptive mining describes concepts or
task-relevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data
Concept description
Characterization provides a concise and succinct
summarization of the given collection of data
Comparison provides descriptions comparing two
or more collections of data

15
Class Characterization An Example
Initial Relation
Prime Generalized Relation
16
2. Frequent Patterns and Association Rules

Itemset Xx1, , xk
Find all the rules X?Y with min confidence and
support
support, s, probability that a transaction
contains X?Y
confidence, c, conditional probability that a
transaction having X also contains Y.

Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Let min_support 50, min_conf 50 A ? C
(50, 66.7) C ? A (50, 100)
17
Apriori A Candidate Generation-and-test Approach

Any subset of a frequent itemset must be frequent
if beer, diaper, nuts is frequent, so is beer,
diaper
Every transaction having beer, diaper, nuts
also contains beer, diaper
Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested!
Method
generate length (k1) candidate itemsets from
length k frequent itemsets, and
test the candidates against DB

18
The Apriori AlgorithmAn Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
19
Sequential Pattern Mining

Given a set of sequences, find the complete set
of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
20
3. Classification Prediction

Classification
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data
Prediction
models continuous-valued functions, i.e.,
predicts unknown or missing values
Typical Applications
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis

21
Training Dataset
This follows an example from Quinlans ID3
22
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
23
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive
divide-and-conquer manner
At start, all the training examples are at the
root
Attributes are categorical (if continuous-valued,
they are discretized in advance)
Examples are partitioned recursively based on
selected attributes
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same
class
There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
There are no samples left

24
Other Classification Techniques

Classification by decision tree induction
Bayesian Classification
Classification by Neural Networks
Classification by Support Vector Machines (SVM)
Classification based on concepts from association
rule mining

25
4. Cluster Analysis

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

26
What Is Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.

27
Major Clustering Approaches

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based based on connectivity and density
functions
Grid-based based on a multiple-level granularity
structure
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other

28
The K-Means Partitioning Algorithm

Given k, the k-means algorithm is implemented in
four steps
Partition objects into k nonempty subsets
Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the
nearest seed point
Go back to Step 2, stop when no more new
assignment

29
5. Mining Complex Types of Data

Mining spatial databases
Mining multimedia databases
Mining time-series and sequence data
Mining stream data
Mining text databases
Mining the World-Wide Web

30
E.g. Mining Time-Series two tasks
Time-series plot
31
Task one Trend analysis

Predict whether increase or decrease
Long-term or trend movements (trend curve)
Cyclic movements or cycle variations, e.g.,
business cycles
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series
appears to follow during corresponding months of
successive years.
Irregular or random movements

32
Task two Similarity Search

Normal database query finds exact match
Similarity search finds data sequences that
differ only slightly from the given query
sequence
Two categories of similarity queries
find a sequence that is similar to the query
sequence
find all pairs of similar sequences

33
Data Warehouse
34
What is Data Warehouse?

Defined in many different ways, but not
rigorously.
A decision support database that is maintained
separately from the organizations operational
database
Support information processing by providing a
solid platform of consolidated, historical data
for analysis.
A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process.W. H. Inmon
Data warehousing
The process of constructing and using data
warehouses

35
Conceptual Modeling of Data Warehouses

Modeling data warehouses dimensions measures
Star schema A fact table in the middle connected
to a set of dimension tables
Snowflake schema A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake
Fact constellations Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
constellation

36
Example of Star Schema

Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
37
Example of Snowflake Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
38
Example of Fact Constellation
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
dollars_cost
units_sold
units_shipped
dollars_sold
avg_sales
Measures
39
Multidimensional Data

Sales volume as a function of product, month, and
region

Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
40
Cuboids Cube
all
0-D(apex) cuboid
region
product
month
1-D cuboids
product, month
product, region
month, region
2-D cuboids
3-D(base) cuboid
product, month, region
41
OLAP Server Architectures

Relational OLAP (ROLAP)
Use relational or extended-relational DBMS to
store and manage warehouse data and OLAP middle
ware to support missing pieces
Include optimization of DBMS backend,
implementation of aggregation navigation logic,
and additional tools and services
greater scalability
Multidimensional OLAP (MOLAP)
Array-based multidimensional storage engine
(sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level relational,
high-level array
Specialized SQL servers
specialized support for SQL queries over
star/snowflake schemas

42
Data Warehouse Back-End Tools and Utilities

Data extraction
get data from multiple, heterogeneous, and
external sources
Data cleaning
detect errors in the data and rectify them when
possible
Data transformation
convert data from legacy or host format to
warehouse format
Load
sort, summarize, consolidate, compute views,
check integrity, and build indicies and
partitions
Refresh
propagate the updates from the data sources to
the warehouse

43
Summary

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
Data mining functionalities characterization,
association, classification, clustering, mining
complex data, etc.
Data warehousing

44
Where to Find Data Mining Papers