CS 405G: Introduction to Database Systems - PowerPoint PPT Presentation

About This Presentation
Title:

CS 405G: Introduction to Database Systems

Description:

CS 405G: Introduction to Database Systems – PowerPoint PPT presentation

Number of Views:412
Avg rating:3.0/5.0
Slides: 39
Provided by: uky48
Category:

less

Transcript and Presenter's Notes

Title: CS 405G: Introduction to Database Systems


1
CS 405G Introduction to Database Systems
2
Review What Is Data Mining?
  • Data mining (knowledge discovery from data)
  • Extraction of interesting (non-trivial, implicit,
    previously unknown and potentially useful)
    patterns or knowledge from huge amount of data
  • What is classification?
  • Predict the value of unseen data
  • What is clustering
  • Grouping similar objects into groups

3
Challenges of Data Mining
  • Scalability
  • Dimensionality
  • Complex and Heterogeneous Data
  • Data Quality
  • Data Ownership and Distribution
  • Privacy Preservation
  • Streaming Data

4
Knowing the Nature of Your Data
  • Data types nominal, ordinal, interval, ratio.
  • Data quality
  • Data preprocessing

5
What is Data?
  • Collection of data objects and their attributes
  • An attribute is a property or characteristic of
    an object
  • Examples eye color of a person, temperature,
    etc.
  • Attribute is also known as variable, field,
    characteristic, or feature
  • A collection of attributes describe an object
  • Object is also known as record, point, case,
    sample, entity, or instance

6
Attribute Values
  • Attribute values are numbers or symbols assigned
    to an attribute
  • Distinction between attributes and attribute
    values
  • Same attribute can be mapped to different
    attribute values
  • Example height can be measured in feet or
    meters
  • Different attributes can be mapped to the same
    set of values
  • Example Attribute values for ID and age are
    integers
  • But properties of attribute values can be
    different
  • ID has no limit but age has a maximum and minimum
    value

7
Types of Attributes
  • There are different types of attributes
  • Nominal
  • Examples ID numbers, eye color, zip codes
  • Ordinal
  • Examples rankings (e.g., taste of potato chips
    on a scale from 1-10), grades, height in tall,
    medium, short
  • Interval
  • Examples calendar dates, temperatures in Celsius
    or Fahrenheit.
  • Ratio
  • Examples temperature in Kelvin, length, time,
    counts

8
Properties of Attribute Values
  • The type of an attribute depends on which of the
    following properties it possesses
  • Distinctness ?
  • Order lt gt
  • Addition -
  • Multiplication /
  • Nominal attribute distinctness
  • Ordinal attribute distinctness order
  • Interval attribute distinctness, order
    addition
  • Ratio attribute all 4 properties

9
Properties of Attribute Values
10
Discrete and Continuous Attributes
  • Discrete Attribute
  • Has only a finite or countablely infinite set of
    values
  • Examples zip codes, counts, or the set of words
    in a collection of documents
  • Often represented as integer variables.
  • Note binary attributes are a special case of
    discrete attributes
  • Continuous Attribute
  • Has real numbers as attribute values
  • Examples temperature, height, or weight.
  • Practically, real values can only be measured and
    represented using a finite number of digits.
  • Continuous attributes are typically represented
    as floating-point variables.

11
Structured vs Unstructured Data
  • Structured Data
  • Data in a relational database
  • Semi-structured data
  • Graphs, trees, sequencs
  • Un-structured data
  • Image, text

12
Important Characteristics Data
  • Dimensionality
  • Curse of Dimensionality
  • Sparsity
  • Only presence counts
  • Resolution
  • Patterns depend on the scale

13
Record Data
  • Data that consists of a collection of records,
    each of which consists of a fixed set of
    attributes

14
Data Matrix
  • If data objects have the same fixed set of
    numeric attributes, then the data objects can be
    thought of as points in a multi-dimensional
    space, where each dimension represents a distinct
    attribute
  • Such data set can be represented by an m by n
    matrix, where there are m rows, one for each
    object, and n columns, one for each attribute

15
Document Data
  • Each document becomes a term' vector,
  • each term is a component (attribute) of the
    vector,
  • the value of each component is the number of
    times the corresponding term occurs in the
    document.

16
Transaction Data
  • A special type of record data, where
  • each record (transaction) involves a set of
    items.
  • For example, consider a grocery store. The set
    of products purchased by a customer during one
    shopping trip constitute a transaction, while the
    individual products that were purchased are the
    items.

17
Data Quality
  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems
  • Noise and outliers
  • missing and duplicated data

18
Noise
  • Noise refers to modification of original values
  • Examples distortion of a persons voice when
    talking on a poor phone and snow on television
    screen

Two Sine Waves
Two Sine Waves Noise
19
Mapping Data to a New Space
  • Fourier transform
  • Wavelet transform

Two Sine Waves
Two Sine Waves Noise
Frequency
20
Outliers
  • Outliers are data objects with characteristics
    that are considerably different than most of the
    other data objects in the data set
  • One persons outlier can be another ones
    treasure!!

21
Missing Values
  • Reasons for missing values
  • Information is not collected (e.g., people
    decline to give their age and weight)
  • Attributes may not be applicable to all cases
    (e.g., annual income is not applicable to
    children)
  • Handling missing values
  • Eliminate Data Objects
  • Estimate Missing Values
  • Ignore the Missing Value During Analysis
  • Replace with all possible values (weighted by
    their probabilities)

22
Duplicate Data
  • Data set may include data objects that are
    duplicates, or almost duplicates of one another
  • Major issue when merging data from heterogeous
    sources
  • Examples
  • Same person with multiple email addresses
  • Data cleaning
  • Process of dealing with duplicate data issues

23
EDA Exploratory Data Analysis
  • Histogram
  • Box plot
  • Scatter plot
  • Correlation

24
Visualization Techniques Histograms
  • Histogram
  • Usually shows the distribution of values of a
    single variable
  • Divide the values into bins and show a bar plot
    of the number of objects in each bin.
  • The height of each bar indicates the number of
    objects
  • Shape of histogram depends on the number of bins
  • Example Petal Width (10 and 20 bins,
    respectively)

25
Two-Dimensional Histograms
  • Show the joint distribution of the values of two
    attributes
  • Example petal width and petal length
  • What does this tell us?

26
Visualization Techniques Box Plots
  • Box Plots
  • Invented by J. Tukey
  • Another way of displaying the distribution of
    data
  • Following figure shows the basic part of a box
    plot

27
Example of Box Plots
  • Box plots can be used to compare attributes

28
Scatter Plot Array of Iris Attributes
29
Correlation
  • Correlation measures the linear relationship
    between objects
  • To compute correlation, we standardize data
    objects, p and q, and then take their dot product

30
Visually Evaluating Correlation
Scatter plots showing the similarity from 1 to 1.
31
Discover Association Rules
  • Apriori Algorithm

32
Association Rule Mining
  • Given a set of transactions, find rules that will
    predict the occurrence of an item based on the
    occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
33
Definition Frequent Itemset
  • Itemset
  • A collection of one or more items
  • Example Milk, Bread, Diaper
  • k-itemset
  • An itemset that contains k items
  • Support count (?)
  • Frequency of occurrence of an itemset
  • E.g. ?(Milk, Bread,Diaper) 2
  • Support
  • Fraction of transactions that contain an itemset
  • E.g. s(Milk, Bread, Diaper) 2/5
  • Frequent Itemset
  • An itemset whose support is greater than or equal
    to a minsup threshold

34
Definition Association Rule
  • Association Rule
  • An implication expression of the form X ? Y,
    where X and Y are itemsets
  • Example Milk, Diaper ? Beer
  • Rule Evaluation Metrics
  • Support (s)
  • Fraction of transactions that contain both X and
    Y
  • Confidence (c)
  • Measures how often items in Y appear in
    transactions thatcontain X

35
Mining Association Rules
Example of Rules Milk,Diaper ? Beer (s0.4,
c0.67)Milk,Beer ? Diaper (s0.4,
c1.0) Diaper,Beer ? Milk (s0.4,
c0.67) Beer ? Milk,Diaper (s0.4, c0.67)
Diaper ? Milk,Beer (s0.4, c0.5) Milk ?
Diaper,Beer (s0.4, c0.5)
  • Observations
  • All the above rules are binary partitions of the
    same itemset Milk, Diaper, Beer
  • Rules originating from the same itemset have
    identical support but can have different
    confidence
  • Thus, we may decouple the support and confidence
    requirements

36
An Exercise
  • The support value of pattern acm is
  • Sup(acm)3
  • The support of pattern ac is
  • Sup(ac)3
  • Given min_sup3, acm is
  • Frequent
  • The confidence of the rule ac gt m is
  • 100

Transaction-id Items bought
100 f, a, c, d, g, I, m, p
200 a, b, c, f, l,m, o
300 b, f, h, j, o
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
Transaction database TDB
37
Mining Association Rules
  • Two-step approach
  • Frequent Itemset Generation
  • Generate all itemsets whose support ? minsup
  • Rule Generation
  • Generate high confidence rules from each frequent
    itemset, where each rule is a binary partitioning
    of a frequent itemset
  • Frequent itemset generation is still
    computationally expensive

38
Frequent Itemset Generation
Given d items, there are 2d possible candidate
itemsets
39
Apriori Algorithm
  • A level-wise, candidate-generation-and-test
    approach (Agrawal Srikant 1994)

Data base D
1-candidates
Freq 1-itemsets
2-candidates
TID Items
10 a, c, d
20 b, c, e
30 a, b, c, e
40 b, e
Itemset Sup
a 2
b 3
c 3
d 1
e 3
Itemset Sup
a 2
b 3
c 3
e 3
Itemset
ab
ac
ae
bc
be
ce
Scan D
Min_sup2
Counting
Freq 2-itemsets
3-candidates
Itemset Sup
ab 1
ac 2
ae 1
bc 2
be 3
ce 2
Itemset Sup
ac 2
bc 2
be 3
ce 2
Itemset
bce
Scan D
Scan D
Freq 3-itemsets
Itemset Sup
bce 2
40
Summary
  • Nature of the data
  • Data types
  • SSN Nominal
  • Grade Ordinal
  • Temperature (degree) Interval
  • Length Ratio
  • Data Quality
  • Noise
  • Outlier
  • Missing/duplicated data

41
Summary
  • Common tools for exploratory data analysis
  • Histogram
  • Box plot
  • Scatter plot
  • Correlation
  • Association
  • Each rule L gt R has two parts L, the left hand
    item set and R the right hand item set
  • Each rule is measured by two parameters
  • Support
  • Confidence
Write a Comment
User Comments (0)
About PowerShow.com