Title: Introduction to Data Mining
1Introduction to Data Mining
- Donghui Zhang
- CCIS, Northeastern University
2http//www.cs.uiuc.edu/hanj
- The current talk slide was extracted and modified
from Dr. Hans lecture slides.
3Motivation
- Data explosion problem
- Automated data collection tools and mature
database technology lead to tremendous amounts of
data accumulated and/or to be analyzed in
databases, data warehouses, and other information
repositories - We are drowning in data, but starving for
knowledge! - Solution Data warehousing and data mining
- Data warehousing and on-line analytical
processing - Mining interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases
4Evolution of Database Technology
- 1960s
- Data collection, database creation, IMS and
network DBMS - 1970s
- Relational data model, relational DBMS
implementation - 1980s
- RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) - Application-oriented DBMS (spatial, scientific,
engineering, etc.) - 1990s
- Data mining, data warehousing, multimedia
databases, and Web databases - 2000s
- Stream data management and mining
- Data mining with a variety of applications
- Web technology and global information systems
5Data Mining Confluence of Multiple Disciplines
Database Systems
Statistics
Data Mining
Machine Learning
Visualization
Algorithm
Other Disciplines
6What Is Data Mining?
- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data - Data mining a misnomer?
- Alternative names
- Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information
harvesting, business intelligence, etc. - Watch out Is everything data mining?
- (Deductive) query processing.
- Expert systems or small ML/statistical programs
7Why Data Mining?Potential Applications
- Data analysis and decision support
- Market analysis and management
- Target marketing, customer relationship
management (CRM), market basket analysis, cross
selling, market segmentation - Risk analysis and management
- Forecasting, customer retention, improved
underwriting, quality control, competitive
analysis - Fraud detection and detection of unusual patterns
(outliers) - Other Applications
- Text mining (news group, email, documents) and
Web mining - Stream data mining
- DNA and bio-data analysis
8Data Mining A KDD Process
Knowledge
- Data miningcore of knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
9Steps of a KDD Process
- Learning the application domain
- relevant prior knowledge and goals of application
- Creating a target data set data selection
- Data cleaning and preprocessing (may take 60 of
effort!) - Data reduction and transformation
- Find useful features, dimensionality/variable
reduction, invariant representation. - Choosing functions of data mining
- summarization, classification, regression,
association, clustering. - Choosing the mining algorithm(s)
- Data mining search for patterns of interest
- Pattern evaluation and knowledge presentation
- visualization, transformation, removing redundant
patterns, etc. - Use of discovered knowledge
10Architecture Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
11Data Mining On What Kinds of Data?
- Relational database
- Data warehouse
- Transactional database
- Advanced database and information repository
- Object-relational database
- Spatial and temporal data
- Time-series data
- Stream data
- Multimedia database
- Heterogeneous and legacy database
- Text databases WWW
12Data Mining Functionalities
- Concept description Characterization and
discrimination - Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions - Association (correlation and causality)
- Diaper à Beer 0.5, 75
- Classification and Prediction
- Construct models (functions) that describe and
distinguish classes or concepts for future
prediction - E.g., classify countries based on climate, or
classify cars based on gas mileage - Presentation decision-tree, classification rule,
neural network - Predict some unknown or missing numerical values
13Data Mining Functionalities (2)
- Cluster analysis
- Class label is unknown Group data to form new
classes, e.g., cluster houses to find
distribution patterns - Maximizing intra-class similarity minimizing
interclass similarity - Mining complex types of data
141. Concept Description
- Descriptive vs. predictive data mining
- Descriptive mining describes concepts or
task-relevant data sets in concise, summarative,
informative, discriminative forms - Predictive mining Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data - Concept description
- Characterization provides a concise and succinct
summarization of the given collection of data - Comparison provides descriptions comparing two
or more collections of data
15Class Characterization An Example
Initial Relation
Prime Generalized Relation
162. Frequent Patterns and Association Rules
- Itemset Xx1, , xk
- Find all the rules X?Y with min confidence and
support - support, s, probability that a transaction
contains X?Y - confidence, c, conditional probability that a
transaction having X also contains Y.
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Let min_support 50, min_conf 50 A ? C
(50, 66.7) C ? A (50, 100)
17Apriori A Candidate Generation-and-test Approach
- Any subset of a frequent itemset must be frequent
- if beer, diaper, nuts is frequent, so is beer,
diaper - Every transaction having beer, diaper, nuts
also contains beer, diaper - Apriori pruning principle If there is any
itemset which is infrequent, its superset should
not be generated/tested! - Method
- generate length (k1) candidate itemsets from
length k frequent itemsets, and - test the candidates against DB
18The Apriori AlgorithmAn Example
Itemset sup
A 2
B 3
C 3
D 1
E 3
Itemset sup
A 2
B 3
C 3
E 3
Database TDB
L1
C1
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
1st scan
C2
C2
Itemset sup
A, B 1
A, C 2
A, E 1
B, C 2
B, E 3
C, E 2
Itemset
A, B
A, C
A, E
B, C
B, E
C, E
L2
2nd scan
Itemset sup
A, C 2
B, C 2
B, E 3
C, E 2
C3
L3
Itemset
B, C, E
3rd scan
Itemset sup
B, C, E 2
19Sequential Pattern Mining
- Given a set of sequences, find the complete set
of frequent subsequences
A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
SID sequence
10 lta(abc)(ac)d(cf)gt
20 lt(ad)c(bc)(ae)gt
30 lt(ef)(ab)(df)cbgt
40 lteg(af)cbcgt
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is a
sequential pattern
203. Classification Prediction
- Classification
- predicts categorical class labels (discrete or
nominal) - classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data - Prediction
- models continuous-valued functions, i.e.,
predicts unknown or missing values - Typical Applications
- credit approval
- target marketing
- medical diagnosis
- treatment effectiveness analysis
21Training Dataset
This follows an example from Quinlans ID3
22Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
23Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
24Other Classification Techniques
- Classification by decision tree induction
- Bayesian Classification
- Classification by Neural Networks
- Classification by Support Vector Machines (SVM)
- Classification based on concepts from association
rule mining
254. Cluster Analysis
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Grouping a set of data objects into clusters
- Clustering is unsupervised classification no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
26What Is Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation. - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
27Major Clustering Approaches
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Density-based based on connectivity and density
functions - Grid-based based on a multiple-level granularity
structure - Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other
28The K-Means Partitioning Algorithm
- Given k, the k-means algorithm is implemented in
four steps - Partition objects into k nonempty subsets
- Compute seed points as the centroids of the
clusters of the current partition (the centroid
is the center, i.e., mean point, of the cluster) - Assign each object to the cluster with the
nearest seed point - Go back to Step 2, stop when no more new
assignment
295. Mining Complex Types of Data
- Mining spatial databases
- Mining multimedia databases
- Mining time-series and sequence data
- Mining stream data
- Mining text databases
- Mining the World-Wide Web
30E.g. Mining Time-Series two tasks
Time-series plot
31Task one Trend analysis
- Predict whether increase or decrease
- Long-term or trend movements (trend curve)
- Cyclic movements or cycle variations, e.g.,
business cycles - Seasonal movements or seasonal variations
- i.e, almost identical patterns that a time series
appears to follow during corresponding months of
successive years. - Irregular or random movements
32Task two Similarity Search
- Normal database query finds exact match
- Similarity search finds data sequences that
differ only slightly from the given query
sequence - Two categories of similarity queries
- find a sequence that is similar to the query
sequence - find all pairs of similar sequences
33Data Warehouse
34What is Data Warehouse?
- Defined in many different ways, but not
rigorously. - A decision support database that is maintained
separately from the organizations operational
database - Support information processing by providing a
solid platform of consolidated, historical data
for analysis. - A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process.W. H. Inmon - Data warehousing
- The process of constructing and using data
warehouses
35Conceptual Modeling of Data Warehouses
- Modeling data warehouses dimensions measures
- Star schema A fact table in the middle connected
to a set of dimension tables - Snowflake schema A refinement of star schema
where some dimensional hierarchy is normalized
into a set of smaller dimension tables, forming a
shape similar to snowflake - Fact constellations Multiple fact tables share
dimension tables, viewed as a collection of
stars, therefore called galaxy schema or fact
constellation
36Example of Star Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
37Example of Snowflake Schema
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
38Example of Fact Constellation
Shipping Fact Table
time_key
Sales Fact Table
item_key
time_key
shipper_key
item_key
from_location
branch_key
to_location
location_key
dollars_cost
units_sold
units_shipped
dollars_sold
avg_sales
Measures
39Multidimensional Data
- Sales volume as a function of product, month, and
region
Dimensions Product, Location, Time Hierarchical
summarization paths
Region
Industry Region Year Category
Country Quarter Product City Month
Week Office Day
Product
Month
40Cuboids Cube
all
0-D(apex) cuboid
region
product
month
1-D cuboids
product, month
product, region
month, region
2-D cuboids
3-D(base) cuboid
product, month, region
41OLAP Server Architectures
- Relational OLAP (ROLAP)
- Use relational or extended-relational DBMS to
store and manage warehouse data and OLAP middle
ware to support missing pieces - Include optimization of DBMS backend,
implementation of aggregation navigation logic,
and additional tools and services - greater scalability
- Multidimensional OLAP (MOLAP)
- Array-based multidimensional storage engine
(sparse matrix techniques) - fast indexing to pre-computed summarized data
- Hybrid OLAP (HOLAP)
- User flexibility, e.g., low level relational,
high-level array - Specialized SQL servers
- specialized support for SQL queries over
star/snowflake schemas
42Data Warehouse Back-End Tools and Utilities
- Data extraction
- get data from multiple, heterogeneous, and
external sources - Data cleaning
- detect errors in the data and rectify them when
possible - Data transformation
- convert data from legacy or host format to
warehouse format - Load
- sort, summarize, consolidate, compute views,
check integrity, and build indicies and
partitions - Refresh
- propagate the updates from the data sources to
the warehouse
43Summary
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - Data mining functionalities characterization,
association, classification, clustering, mining
complex data, etc. - Data warehousing
44Where to Find Data Mining Papers
- Data mining and KDD (SIGKDD CDROM)
- Conferences ACM-SIGKDD, IEEE-ICDM, SIAM-DM,
PKDD, PAKDD, etc. - Journal Data Mining and Knowledge Discovery, KDD
Explorations - Database systems (SIGMOD CD ROM)
- Conferences ACM-SIGMOD, ACM-PODS, VLDB,
IEEE-ICDE, EDBT, ICDT, DASFAA - Journals ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
- AI Machine Learning
- Conferences Machine learning (ML), AAAI, IJCAI,
COLT (Learning Theory), etc. - Journals Machine Learning, Artificial
Intelligence, etc. - Statistics
- Conferences Joint Stat. Meeting, etc.
- Journals Annals of statistics, etc.
- Visualization
- Conference proceedings CHI, ACM-SIGGraph, etc.
- Journals IEEE Trans. visualization and computer
graphics, etc.