Title: CS 405G: Introduction to Database Systems
1CS 405G Introduction to Database Systems
2Review What Is Data Mining?
- Data mining (knowledge discovery from data)
- Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data - What is classification?
- Predict the value of unseen data
- What is clustering
- Grouping similar objects into groups
3Challenges of Data Mining
- Scalability
- Dimensionality
- Complex and Heterogeneous Data
- Data Quality
- Data Ownership and Distribution
- Privacy Preservation
- Streaming Data
4Knowing the Nature of Your Data
- Data types nominal, ordinal, interval, ratio.
- Data quality
- Data preprocessing
5What is Data?
- Collection of data objects and their attributes
- An attribute is a property or characteristic of
an object - Examples eye color of a person, temperature,
etc. - Attribute is also known as variable, field,
characteristic, or feature - A collection of attributes describe an object
- Object is also known as record, point, case,
sample, entity, or instance
6Attribute Values
- Attribute values are numbers or symbols assigned
to an attribute - Distinction between attributes and attribute
values - Same attribute can be mapped to different
attribute values - Example height can be measured in feet or
meters - Different attributes can be mapped to the same
set of values - Example Attribute values for ID and age are
integers - But properties of attribute values can be
different - ID has no limit but age has a maximum and minimum
value
7Types of Attributes
- There are different types of attributes
- Nominal
- Examples ID numbers, eye color, zip codes
- Ordinal
- Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short - Interval
- Examples calendar dates, temperatures in Celsius
or Fahrenheit. - Ratio
- Examples temperature in Kelvin, length, time,
counts
8Properties of Attribute Values
- The type of an attribute depends on which of the
following properties it possesses - Distinctness ?
- Order lt gt
- Addition -
- Multiplication /
- Nominal attribute distinctness
- Ordinal attribute distinctness order
- Interval attribute distinctness, order
addition - Ratio attribute all 4 properties
9Properties of Attribute Values
10Discrete and Continuous Attributes
- Discrete Attribute
- Has only a finite or countablely infinite set of
values - Examples zip codes, counts, or the set of words
in a collection of documents - Often represented as integer variables.
- Note binary attributes are a special case of
discrete attributes - Continuous Attribute
- Has real numbers as attribute values
- Examples temperature, height, or weight.
- Practically, real values can only be measured and
represented using a finite number of digits. - Continuous attributes are typically represented
as floating-point variables.
11Structured vs Unstructured Data
- Structured Data
- Data in a relational database
- Semi-structured data
- Graphs, trees, sequencs
- Un-structured data
- Image, text
12Important Characteristics Data
- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale
13Record Data
- Data that consists of a collection of records,
each of which consists of a fixed set of
attributes
14Data Matrix
- If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute - Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute
15Document Data
- Each document becomes a term' vector,
- each term is a component (attribute) of the
vector, - the value of each component is the number of
times the corresponding term occurs in the
document.
16Transaction Data
- A special type of record data, where
- each record (transaction) involves a set of
items. - For example, consider a grocery store. The set
of products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the
items.
17Data Quality
- What kinds of data quality problems?
- How can we detect problems with the data?
- What can we do about these problems?
- Examples of data quality problems
- Noise and outliers
- missing and duplicated data
18Noise
- Noise refers to modification of original values
- Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen
Two Sine Waves
Two Sine Waves Noise
19Mapping Data to a New Space
- Fourier transform
- Wavelet transform
Two Sine Waves
Two Sine Waves Noise
Frequency
20Outliers
- Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set - One persons outlier can be another ones
treasure!!
21Missing Values
- Reasons for missing values
- Information is not collected (e.g., people
decline to give their age and weight) - Attributes may not be applicable to all cases
(e.g., annual income is not applicable to
children) - Handling missing values
- Eliminate Data Objects
- Estimate Missing Values
- Ignore the Missing Value During Analysis
- Replace with all possible values (weighted by
their probabilities)
22Duplicate Data
- Data set may include data objects that are
duplicates, or almost duplicates of one another - Major issue when merging data from heterogeous
sources - Examples
- Same person with multiple email addresses
- Data cleaning
- Process of dealing with duplicate data issues
23EDA Exploratory Data Analysis
- Histogram
- Box plot
- Scatter plot
- Correlation
24Visualization Techniques Histograms
- Histogram
- Usually shows the distribution of values of a
single variable - Divide the values into bins and show a bar plot
of the number of objects in each bin. - The height of each bar indicates the number of
objects - Shape of histogram depends on the number of bins
- Example Petal Width (10 and 20 bins,
respectively)
25Two-Dimensional Histograms
- Show the joint distribution of the values of two
attributes - Example petal width and petal length
- What does this tell us?
26Visualization Techniques Box Plots
- Box Plots
- Invented by J. Tukey
- Another way of displaying the distribution of
data - Following figure shows the basic part of a box
plot
27Example of Box Plots
- Box plots can be used to compare attributes
28Scatter Plot Array of Iris Attributes
29Correlation
- Correlation measures the linear relationship
between objects - To compute correlation, we standardize data
objects, p and q, and then take their dot product
30Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
31Discover Association Rules
32Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
33Definition Frequent Itemset
- Itemset
- A collection of one or more items
- Example Milk, Bread, Diaper
- k-itemset
- An itemset that contains k items
- Support count (?)
- Frequency of occurrence of an itemset
- E.g. ?(Milk, Bread,Diaper) 2
- Support
- Fraction of transactions that contain an itemset
- E.g. s(Milk, Bread, Diaper) 2/5
- Frequent Itemset
- An itemset whose support is greater than or equal
to a minsup threshold
34Definition Association Rule
- Association Rule
- An implication expression of the form X ? Y,
where X and Y are itemsets - Example Milk, Diaper ? Beer
- Rule Evaluation Metrics
- Support (s)
- Fraction of transactions that contain both X and
Y - Confidence (c)
- Measures how often items in Y appear in
transactions thatcontain X
35Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
- Observations
- All the above rules are binary partitions of the
same itemset Milk, Diaper, Beer - Rules originating from the same itemset have
identical support but can have different
confidence - Thus, we may decouple the support and confidence
requirements
36An Exercise
- The support value of pattern acm is
- Sup(acm)3
- The support of pattern ac is
- Sup(ac)3
- Given min_sup3, acm is
- Frequent
- The confidence of the rule ac gt m is
- 100
Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
Transaction database TDB
37Mining Association Rules
- Two-step approach
- Frequent Itemset Generation
- Generate all itemsets whose support ? minsup
- Rule Generation
- Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning
of a frequent itemset - Frequent itemset generation is still
computationally expensive
38Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
39Apriori Algorithm
- A level-wise, candidate-generation-and-test
approach (Agrawal Srikant 1994)
Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
40Summary
- Nature of the data
- Data types
- SSN Nominal
- Grade Ordinal
- Temperature (degree) Interval
- Length Ratio
- Data Quality
- Noise
- Outlier
- Missing/duplicated data
41Summary
- Common tools for exploratory data analysis
- Histogram
- Box plot
- Scatter plot
- Correlation
- Association
- Each rule L gt R has two parts L, the left hand
item set and R the right hand item set - Each rule is measured by two parameters
- Support
- Confidence