Topics Related to Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Topics Related to Data Mining

Description:

... in Kelvin, length, time, counts. Properties of Attribute Values ... Combining two or more attributes (or objects) into a single attribute (or object) ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 47
Provided by: ksu7
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Topics Related to Data Mining


1
Topics Related to Data Mining
  • CS 4/59995

2
Information Retrieval
  • Relevance Ranking Using Terms
  • Relevance Using Hyperlinks
  • Synonyms., Homonyms, and Ontologies
  • Indexing of Documents
  • Measuring Retrieval Effectiveness
  • Information Retrieval and Structured Data

3
Information Retrieval Systems
  • Information retrieval (IR) systems use a simpler
    data model than database systems
  • Information organized as a collection of
    documents
  • Documents are unstructured, no schema
  • Information retrieval locates relevant documents,
    on the basis of user input such as keywords or
    example documents
  • e.g., find documents containing the words
    database systems
  • Can be used even on textual descriptions provided
    with non-textual data such as images

4
Keyword Search
  • In full text retrieval, all the words in each
    document are considered to be keywords.
  • We use the word term to refer to the words in a
    document
  • Information-retrieval systems typically allow
    query expressions formed using keywords and the
    logical connectives and, or, and not
  • Ands are implicit, even if not explicitly
    specified
  • Ranking of documents on the basis of estimated
    relevance to a query is critical
  • Relevance ranking is based on factors such as
  • Term frequency
  • Frequency of occurrence of query keyword in
    document
  • Inverse document frequency
  • How many documents the query keyword occurs in
  • Fewer ? give more importance to keyword
  • Hyperlinks to documents
  • More links to a document ? document is more
    important

5
Relevance Ranking Using Terms
  • TF-IDF (Term frequency/Inverse Document
    frequency) ranking
  • Let n(d) number of terms in the document d
  • n(d, t) number of occurrences of term t in the
    document d.
  • Relevance of a document d to a term t
  • The log factor is to avoid excessive weight to
    frequent terms
  • Relevance of document to query Q

n(d, t)
1
TF (d, t) log
n(d)
TF (d, t)
?
r (d, Q)
n(t)
t?Q
IDF1/n(t), n(t) is the number of documents that
contain the term t
6
Relevance Ranking Using Terms (Cont.)
  • Most systems add to the above model
  • Words that occur in title, author list, section
    headings, etc. are given greater importance
  • Words whose first occurrence is late in the
    document are given lower importance
  • Very common words such as a, an, the, it
    etc are eliminated
  • Called stop words
  • Proximity if keywords in query occur close
    together in the document, the document has higher
    importance than if they occur far apart
  • Documents are returned in decreasing order of
    relevance score
  • Usually only top few documents are returned, not
    all

7
Synonyms and Homonyms
  • Synonyms
  • E.g. document motorcycle repair, query
    motorcycle maintenance
  • need to realize that maintenance and repair
    are synonyms
  • System can extend query as motorcycle and
    (repair or maintenance)
  • Homonyms
  • E.g. object has different meanings as noun/verb
  • Can disambiguate meanings (to some extent) from
    the context
  • Extending queries automatically using synonyms
    can be problematic
  • Need to understand intended meaning in order to
    infer synonyms
  • Or verify synonyms with user
  • Synonyms may have other meanings as well

8
Indexing of Documents
  • An inverted index maps each keyword Ki to a set
    of documents Si that contain the keyword
  • Documents identified by identifiers
  • Inverted index may record
  • Keyword locations within document to allow
    proximity based ranking
  • Counts of number of occurrences of keyword to
    compute TF
  • and operation Finds documents that contain all
    of K1, K2, ..., Kn.
  • Intersection S1? S2 ?..... ? Sn
  • or operation documents that contain at least one
    of K1, K2, , Kn
  • union, S1?S2 ?..... ? Sn,.
  • Each Si is kept sorted to allow efficient
    intersection/union by merging
  • not can also be efficiently implemented by
    merging of sorted lists

9
Word-Level Inverted File
lexicon
posting
10
Measuring Retrieval Effectiveness
  • Information-retrieval systems save space by using
    index structures that support only approximate
    retrieval. May result in
  • false negative (false drop) - some relevant
    documents may not be retrieved.
  • false positive - some irrelevant documents may be
    retrieved.
  • For many applications a good index should not
    permit any false drops, but may permit a few
    false positives.
  • Relevant performance metrics
  • precision - what percentage of the retrieved
    documents are relevant to the query.
  • recall - what percentage of the documents
    relevant to the query were retrieved.

11
Measuring Retrieval Effectiveness (Cont.)
  • Recall vs. precision tradeoff
  • Can increase recall by retrieving many documents
    (down to a low level of relevance ranking), but
    many irrelevant documents would be fetched,
    reducing precision
  • Measures of retrieval effectiveness
  • Recall as a function of number of documents
    fetched, or
  • Precision as a function of recall
  • Equivalently, as a function of number of
    documents fetched
  • E.g. precision of 75 at recall of 50, and 60
    at a recall of 75
  • Problem which documents are actually relevant,
    and which are not

12
Information Retrieval and Structured Data
  • Information retrieval systems originally treated
    documents as a collection of words
  • Information extraction systems infer structure
    from documents, e.g.
  • Extraction of house attributes (size, address,
    number of bedrooms, etc.) from a text
    advertisement
  • Extraction of topic and people named from a new
    article
  • Relations or XML structures used to store
    extracted data
  • System seeks connections among data to answer
    queries
  • Question answering systems

13
Probabilities and Statistic
14
Probabilities
1.
2.
Event E is defined as a any subset of
f(x) is called a probability distribution
function (pdf)
15
Conditional Probabilities
Conditional probability of E, provided that G
occurred is
E and G are independent if and only if
.
Expected Value
Expected value of X is
For continuous function f(x), the E(X) is
E(XY) E(X)E(Y) E(aXb) aE(X)b
16
Variance
2
2
  • Var(X) E(X-E(X))
  • It indicates how values of random variable are
    distributed around its expected value
  • Standard deviation of X is defined as
  • VAR(XY) VAR(X) VAR(Y)
  • VAR(aXb) VAR(X)b
  • P(S - E(S) r) VAR(S)/r
    (Chebyshevs Ineequality)

2
2
2
17
Random Distributions
Normal
µ
E(X)
2
Var(X) s
Bernoulli
E(X) np Var(X) np(1-p)
18
Normal Distributions
E(x)
19
Random Distributions
Geometric
2
E(X) 1/p VAR(X) (1-p)/p
Poisson
E(X)VAR(X)m
P(Xx) 1/(b-a)
Uniform
2
E(X)(b-a)/2 VAR(X) (b-a) /12
20
Data and their characteristics
21
Types of Attributes
  • There are different types of attributes
  • Nominal
  • Examples ID numbers, eye color, zip codes
  • Ordinal
  • Examples rankings (e.g., taste of potato chips
    on a scale from 1-10), grades, height in tall,
    medium, short
  • Interval
  • Examples calendar dates, temperatures in Celsius
    or Fahrenheit.
  • Ratio
  • Examples temperature in Kelvin, length, time,
    counts

22
Properties of Attribute Values
  • The type of an attribute depends on which of the
    following properties it possesses
  • Distinctness ?
  • Order lt gt
  • Addition -
  • Multiplication /
  • Nominal attribute distinctness
  • Ordinal attribute distinctness order
  • Interval attribute distinctness, order
    addition
  • Ratio attribute all 4 properties

23
(No Transcript)
24
Discrete and Continuous Attributes
  • Discrete Attribute
  • Has only a finite or countably infinite set of
    values
  • Examples zip codes, counts, or the set of words
    in a collection of documents
  • Often represented as integer variables.
  • Note binary attributes are a special case of
    discrete attributes
  • Continuous Attribute
  • Has real numbers as attribute values
  • Examples temperature, height, or weight.
  • Practically, real values can only be measured and
    represented using a finite number of digits.
  • Continuous attributes are typically represented
    as floating-point variables.

25
Data Matrix
  • If data objects have the same fixed set of
    numeric attributes, then the data objects can be
    thought of as points in a multi-dimensional
    space, where each dimension represents a distinct
    attribute
  • Such data set can be represented by an m by n
    matrix, where there are m rows, one for each
    object, and n columns, one for each attribute

26
Data Quality
  • What kinds of data quality problems?
  • How can we detect problems with the data?
  • What can we do about these problems?
  • Examples of data quality problems
  • Noise and outliers
  • missing values
  • duplicate data

27
Noise
  • Noise refers to modification of original values
  • Examples distortion of a persons voice when
    talking on a poor phone and snow on television
    screen

Two Sine Waves
Two Sine Waves Noise
28
Outliers
  • Outliers are data objects with characteristics
    that are considerably different than most of the
    other data objects in the data set

29
Data Preprocessing
  • Aggregation
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Feature creation
  • Discretization and Binarization
  • Attribute Transformation

30
Aggregation
  • Combining two or more attributes (or objects)
    into a single attribute (or object)
  • Purpose
  • Data reduction
  • Reduce the number of attributes or objects
  • Change of scale
  • Cities aggregated into regions, states,
    countries, etc
  • More stable data
  • Aggregated data tends to have less variability

31
Sampling
  • Sampling is the main technique employed for data
    selection.
  • It is often used for both the preliminary
    investigation of the data and the final data
    analysis.
  • Statisticians sample because obtaining the entire
    set of data of interest is too expensive or time
    consuming.
  • Sampling is used in data mining because
    processing the entire set of data of interest is
    too expensive or time consuming.

32
Sampling
  • The key principle for effective sampling is the
    following
  • using a sample will work almost as well as using
    the entire data sets, if the sample is
    representative
  • A sample is representative if it has
    approximately the same property (of interest) as
    the original set of data

33
Types of Sampling
  • Simple Random Sampling
  • There is an equal probability of selecting any
    particular item
  • Sampling without replacement
  • As each item is selected, it is removed from the
    population
  • Sampling with replacement
  • Objects are not removed from the population as
    they are selected for the sample.
  • In sampling with replacement, the same object
    can be picked up more than once
  • Stratified sampling
  • Split the data into several partitions then draw
    random samples from each partition

34
Curse of Dimensionality
  • When dimensionality increases, data becomes
    increasingly sparse in the space that it occupies
  • Definitions of density and distance between
    points, which is critical for clustering and
    outlier detection, become less meaningful
  • Randomly generate 500 points
  • Compute difference between max and min distance
    between any pair of points

35
Discretization Using Class Labels
  • Entropy based approach

3 categories for both x and y
5 categories for both x and y
36
Discretization Without Using Class Labels
Data
Equal interval width
Equal frequency
K-means
37
Similarity and Dissimilarity
  • Similarity
  • Numerical measure of how alike two data objects
    are.
  • Is higher when objects are more alike.
  • Often falls in the range 0,1
  • Dissimilarity
  • Numerical measure of how different are two data
    objects
  • Lower when objects are more alike
  • Minimum dissimilarity is often 0
  • Upper limit varies
  • Proximity refers to a similarity or dissimilarity

38
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data
objects.
39
Euclidean Distance
  • Euclidean Distance
  • Where n is the number of dimensions
    (attributes) and pk and qk are, respectively, the
    kth attributes (components) or data objects p and
    q.

40
Euclidean Distance
Distance Matrix
41
Minkowski Distance
  • Minkowski Distance is a generalization of
    Euclidean Distance
  • Where r is a parameter, n is the number of
    dimensions (attributes) and pk and qk are,
    respectively, the kth attributes (components) or
    data objects p and q.

42
Minkowski Distance
Distance Matrix
43
Common Properties of a Distance
  • Distances, such as the Euclidean distance, have
    some well known properties.
  • d(p, q) ? 0 for all p and q and d(p, q) 0
    only if p q. (Positive definiteness)
  • d(p, q) d(q, p) for all p and q. (Symmetry)
  • d(p, r) ? d(p, q) d(q, r) for all points p,
    q, and r. (Triangle Inequality)
  • where d(p, q) is the distance (dissimilarity)
    between points (data objects), p and q.
  • A distance that satisfies these properties is
    called a metric

44
Common Properties of a Similarity
  • Similarities, also have some well known
    properties.
  • s(p, q) 1 (or maximum similarity) only if p
    q.
  • s(p, q) s(q, p) for all p and q. (Symmetry)
  • where s(p, q) is the similarity between points
    (data objects), p and q.

45
Similarity Between Binary Vectors
  • Common situation is that objects, p and q, have
    only binary attributes
  • Compute similarities using the following
    quantities
  • M01 the number of attributes where p was 0 and
    q was 1
  • M10 the number of attributes where p was 1 and
    q was 0
  • M00 the number of attributes where p was 0 and
    q was 0
  • M11 the number of attributes where p was 1 and
    q was 1
  • Simple Matching and Jaccard Coefficients
  • SMC number of matches / number of attributes
  • (M11 M00) / (M01 M10 M11
    M00)
  • J number of 11 matches / number of
    not-both-zero attributes values
  • (M11) / (M01 M10 M11)

46
SMC versus Jaccard Example
  • p 1 0 0 0 0 0 0 0 0 0
  • q 0 0 0 0 0 0 1 0 0 1
  • M01 2 (the number of attributes where p was 0
    and q was 1)
  • M10 1 (the number of attributes where p was 1
    and q was 0)
  • M00 7 (the number of attributes where p was 0
    and q was 0)
  • M11 0 (the number of attributes where p was 1
    and q was 1)
  • SMC (M11 M00)/(M01 M10 M11 M00) (07)
    / (2107) 0.7
  • J (M11) / (M01 M10 M11) 0 / (2 1 0)
    0
Write a Comment
User Comments (0)
About PowerShow.com