Topics Related to Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Topics Related to Data Mining

Description:

... in Kelvin, length, time, counts. Properties of Attribute Values ... Combining two or more attributes (or objects) into a single attribute (or object) ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 47

Provided by: ksu7

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Topics Related to Data Mining

1
Topics Related to Data Mining

CS 4/59995

2
Information Retrieval

Relevance Ranking Using Terms
Relevance Using Hyperlinks
Synonyms., Homonyms, and Ontologies
Indexing of Documents
Measuring Retrieval Effectiveness
Information Retrieval and Structured Data

3
Information Retrieval Systems

Information retrieval (IR) systems use a simpler
data model than database systems
Information organized as a collection of
documents
Documents are unstructured, no schema
Information retrieval locates relevant documents,
on the basis of user input such as keywords or
example documents
e.g., find documents containing the words
database systems
Can be used even on textual descriptions provided
with non-textual data such as images

4
Keyword Search

In full text retrieval, all the words in each
document are considered to be keywords.
We use the word term to refer to the words in a
document
Information-retrieval systems typically allow
query expressions formed using keywords and the
logical connectives and, or, and not
Ands are implicit, even if not explicitly
specified
Ranking of documents on the basis of estimated
relevance to a query is critical
Relevance ranking is based on factors such as
Term frequency
Frequency of occurrence of query keyword in
document
Inverse document frequency
How many documents the query keyword occurs in
Fewer ? give more importance to keyword
Hyperlinks to documents
More links to a document ? document is more
important

5
Relevance Ranking Using Terms

TF-IDF (Term frequency/Inverse Document
frequency) ranking
Let n(d) number of terms in the document d
n(d, t) number of occurrences of term t in the
document d.
Relevance of a document d to a term t
The log factor is to avoid excessive weight to
frequent terms
Relevance of document to query Q

n(d, t)
1
TF (d, t) log
n(d)
TF (d, t)
?
r (d, Q)
n(t)
t?Q
IDF1/n(t), n(t) is the number of documents that
contain the term t
6
Relevance Ranking Using Terms (Cont.)

Most systems add to the above model
Words that occur in title, author list, section
headings, etc. are given greater importance
Words whose first occurrence is late in the
document are given lower importance
Very common words such as a, an, the, it
etc are eliminated
Called stop words
Proximity if keywords in query occur close
together in the document, the document has higher
importance than if they occur far apart
Documents are returned in decreasing order of
relevance score
Usually only top few documents are returned, not
all

7
Synonyms and Homonyms

Synonyms
E.g. document motorcycle repair, query
motorcycle maintenance
need to realize that maintenance and repair
are synonyms
System can extend query as motorcycle and
(repair or maintenance)
Homonyms
E.g. object has different meanings as noun/verb
Can disambiguate meanings (to some extent) from
the context
Extending queries automatically using synonyms
can be problematic
Need to understand intended meaning in order to
infer synonyms
Or verify synonyms with user
Synonyms may have other meanings as well

8
Indexing of Documents

An inverted index maps each keyword Ki to a set
of documents Si that contain the keyword
Documents identified by identifiers
Inverted index may record
Keyword locations within document to allow
proximity based ranking
Counts of number of occurrences of keyword to
compute TF
and operation Finds documents that contain all
of K1, K2, ..., Kn.
Intersection S1? S2 ?..... ? Sn
or operation documents that contain at least one
of K1, K2, , Kn
union, S1?S2 ?..... ? Sn,.
Each Si is kept sorted to allow efficient
intersection/union by merging
not can also be efficiently implemented by
merging of sorted lists

9
Word-Level Inverted File
lexicon
posting
10
Measuring Retrieval Effectiveness

Information-retrieval systems save space by using
index structures that support only approximate
retrieval. May result in
false negative (false drop) - some relevant
documents may not be retrieved.
false positive - some irrelevant documents may be
retrieved.
For many applications a good index should not
permit any false drops, but may permit a few
false positives.
Relevant performance metrics
precision - what percentage of the retrieved
documents are relevant to the query.
recall - what percentage of the documents
relevant to the query were retrieved.

11
Measuring Retrieval Effectiveness (Cont.)

Recall vs. precision tradeoff
Can increase recall by retrieving many documents
(down to a low level of relevance ranking), but
many irrelevant documents would be fetched,
reducing precision
Measures of retrieval effectiveness
Recall as a function of number of documents
fetched, or
Precision as a function of recall
Equivalently, as a function of number of
documents fetched
E.g. precision of 75 at recall of 50, and 60
at a recall of 75
Problem which documents are actually relevant,
and which are not

12
Information Retrieval and Structured Data

Information retrieval systems originally treated
documents as a collection of words
Information extraction systems infer structure
from documents, e.g.
Extraction of house attributes (size, address,
number of bedrooms, etc.) from a text
advertisement
Extraction of topic and people named from a new
article
Relations or XML structures used to store
extracted data
System seeks connections among data to answer
queries
Question answering systems

13
Probabilities and Statistic
14
Probabilities
1.
2.
Event E is defined as a any subset of
f(x) is called a probability distribution
function (pdf)
15
Conditional Probabilities
Conditional probability of E, provided that G
occurred is
E and G are independent if and only if
.
Expected Value
Expected value of X is
For continuous function f(x), the E(X) is
E(XY) E(X)E(Y) E(aXb) aE(X)b
16
Variance
2
2

Var(X) E(X-E(X))
It indicates how values of random variable are
distributed around its expected value
Standard deviation of X is defined as
VAR(XY) VAR(X) VAR(Y)
VAR(aXb) VAR(X)b
P(S - E(S) r) VAR(S)/r
(Chebyshevs Ineequality)

2
2
2
17
Random Distributions
Normal
µ
E(X)
2
Var(X) s
Bernoulli
E(X) np Var(X) np(1-p)
18
Normal Distributions
E(x)
19
Random Distributions
Geometric
2
E(X) 1/p VAR(X) (1-p)/p
Poisson
E(X)VAR(X)m
P(Xx) 1/(b-a)
Uniform
2
E(X)(b-a)/2 VAR(X) (b-a) /12
20
Data and their characteristics
21
Types of Attributes

There are different types of attributes
Nominal
Examples ID numbers, eye color, zip codes
Ordinal
Examples rankings (e.g., taste of potato chips
on a scale from 1-10), grades, height in tall,
medium, short
Interval
Examples calendar dates, temperatures in Celsius
or Fahrenheit.
Ratio
Examples temperature in Kelvin, length, time,
counts

22
Properties of Attribute Values

The type of an attribute depends on which of the
following properties it possesses
Distinctness ?
Order lt gt
Addition -
Multiplication /
Nominal attribute distinctness
Ordinal attribute distinctness order
Interval attribute distinctness, order
addition
Ratio attribute all 4 properties

23
(No Transcript)
24
Discrete and Continuous Attributes

Discrete Attribute
Has only a finite or countably infinite set of
values
Examples zip codes, counts, or the set of words
in a collection of documents
Often represented as integer variables.
Note binary attributes are a special case of
discrete attributes
Continuous Attribute
Has real numbers as attribute values
Examples temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented
as floating-point variables.

25
Data Matrix

If data objects have the same fixed set of
numeric attributes, then the data objects can be
thought of as points in a multi-dimensional
space, where each dimension represents a distinct
attribute
Such data set can be represented by an m by n
matrix, where there are m rows, one for each
object, and n columns, one for each attribute

26
Data Quality

What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems
Noise and outliers
missing values
duplicate data

27
Noise

Noise refers to modification of original values
Examples distortion of a persons voice when
talking on a poor phone and snow on television
screen

Two Sine Waves
Two Sine Waves Noise
28
Outliers

Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set

29
Data Preprocessing

Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation

30
Aggregation

Combining two or more attributes (or objects)
into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states,
countries, etc
More stable data
Aggregated data tends to have less variability

31
Sampling

Sampling is the main technique employed for data
selection.
It is often used for both the preliminary
investigation of the data and the final data
analysis.
Statisticians sample because obtaining the entire
set of data of interest is too expensive or time
consuming.
Sampling is used in data mining because
processing the entire set of data of interest is
too expensive or time consuming.

32
Sampling

The key principle for effective sampling is the
following
using a sample will work almost as well as using
the entire data sets, if the sample is
representative
A sample is representative if it has
approximately the same property (of interest) as
the original set of data

33
Types of Sampling

Simple Random Sampling
There is an equal probability of selecting any
particular item
Sampling without replacement
As each item is selected, it is removed from the
population
Sampling with replacement
Objects are not removed from the population as
they are selected for the sample.
In sampling with replacement, the same object
can be picked up more than once
Stratified sampling
Split the data into several partitions then draw
random samples from each partition

34
Curse of Dimensionality

When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
Definitions of density and distance between
points, which is critical for clustering and
outlier detection, become less meaningful

Randomly generate 500 points
Compute difference between max and min distance
between any pair of points

35
Discretization Using Class Labels

Entropy based approach

3 categories for both x and y
5 categories for both x and y
36
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
37
Similarity and Dissimilarity

Similarity
Numerical measure of how alike two data objects
are.
Is higher when objects are more alike.
Often falls in the range 0,1
Dissimilarity
Numerical measure of how different are two data
objects
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity

38
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
39
Euclidean Distance

Euclidean Distance
Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the
kth attributes (components) or data objects p and
q.

40
Euclidean Distance
Distance Matrix
41
Minkowski Distance

Minkowski Distance is a generalization of
Euclidean Distance
Where r is a parameter, n is the number of
dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.

42
Minkowski Distance
Distance Matrix
43
Common Properties of a Distance

Distances, such as the Euclidean distance, have
some well known properties.
d(p, q) ? 0 for all p and q and d(p, q) 0
only if p q. (Positive definiteness)
d(p, q) d(q, p) for all p and q. (Symmetry)
d(p, r) ? d(p, q) d(q, r) for all points p,
q, and r. (Triangle Inequality)
where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties is
called a metric

44
Common Properties of a Similarity