Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Sequences and Strings A sequence x is an ordered list of ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 24
Provided by: cecsWrig7
Learn more at: http://cecs.wright.edu
Category:
Tags: analysis | data | mining

less

Transcript and Presenter's Notes

Title: Data Mining


1
7. Sequence Mining
  • Sequences and Strings
  • Recognition with Strings
  • MM HMM
  • Sequence Association Rules

2
Sequences and Strings
  • A sequence x is an ordered list of discrete
    items, such as a sequence of letters or a gene
    sequence
  • Sequences and strings are often used as synonyms
  • String elements (characters, letters, or symbols)
    are nominal
  • A type of particularly long string ? text
  • x denotes the length of sequence x
  • AGCTTC is 6
  • Any contiguous string that is part of x is called
    a substring, segment, or factor of x
  • GCT is a factor of AGCTTC

3
Recognition with Strings
  • String matching
  • Given x and text, determine whether x is a factor
    of text
  • Edit distance (for inexact string matching)
  • Given two strings x and y, compute the minimum
    number of basic operations (character insertions,
    deletions and exchanges) needed to transform x
    into y

4
String Matching
  • Given text gtgt x, with characters taken from
    an alphabet A
  • A can be 0, 1, 0, 1, 2,, 9, A,G,C,T, or
    A, B,
  • A shift s is an offset needed to align the first
    character of x with character number s1 in text
  • Find if there exists a valid shift where there is
    a perfect match between characters in x and the
    corresponding ones in text

5
NaĂŻve (Brute-Force) String Matching
  • Given A, x, text, n text, m x
  • s 0
  • while s n-m
  • if x1 m text s1 sm
  • then print pattern occurs at shift s
  • s s 1
  • Time complexity (worst case) O((n-m1)m)
  • One character shift at a time is not necessary

6
Boyer-Moore and KMP
  • See StringMatching.ppt and do not use the
    following alg
  • Given A, x, text, n text, m x
  • F(x) last-occurrence function
  • G(x) good-suffix function s 0
  • while s n-m
  • j m
  • while jgt0 and xj text sj
  • j j-1
  • if j 0
  • then print pattern occurs at shift s
  • s s G(0)
  • else s s maxG(j), j-F(textsj0)

7
Edit Distance
  • ED between x and y describes how many fundamental
    operations are required to transform x to y.
  • Fundamental operations (xexcused,
    yexhausted)
  • Substitutions e.g. c is replaced by h
  • Insertions e.g. a is inserted into x after h
  • Deletions e.g. a character in x is deleted
  • ED is one way of measuring similarity between two
    strings

8
Classification using ED
  • Nearest-neighbor algorithm can be applied for
    pattern recognition.
  • Training data of strings with their class labels
    stored
  • Classification (testing) a test string is
    compared to each stored string and an ED is
    computed the nearest stored strings label is
    assigned to the test string.
  • The key is how to calculate ED.
  • An example of calculating ED

9
Hidden Markov Model
  • Markov Model transitional states
  • Hidden Markov Model additional visible states
  • Evaluation
  • Decoding
  • Learning

10
Markov Model
  • The Markov property
  • given the current state, the transition
    probability is independent of any previous
    states.
  • A simple Markov Model
  • State ?(t) at time t
  • Sequence of length T
  • ?T ?(1), ?(2), , ?(T)
  • Transition probability
  • P(? j(t1) ? i(t)) aij
  • Its not required that aij aji

11
Hidden Markov Model
  • Visible states
  • VT v(1), v(2), , v(T)
  • Emitting a visible state vk(t)
  • P(v k(t) ? j(t)) bjk
  • Only visible states vk (t) are accessible and
    states ?i (t) are unobservable.
  • A Markov model is ergodic if every state has a
    nonzero prob of occuring give some starting
    state.

12
Three Key Issues with HMM
  • Evaluation
  • Given an HMM, complete with transition
    probabilities aij and bjk. Determine the
    probability that a particular sequence of visible
    states VT was generated by that model
  • Decoding
  • Given an HMM and a set of observations VT.
    Determine the most likely sequence of hidden
    states ?T that led to VT.
  • Learning
  • Given the number of states and visible states and
    a set of training observations of visible
    symbols, determine the probabilities aij and bjk.

13
Other Sequential Patterns Mining Problems
  • Sequence alignment (homology) and sequence
    assembly (genome sequencing)
  • Trend analysis
  • Trend movement vs. cyclic variations, seasonal
    variations and random fluctuations
  • Sequential pattern mining
  • Various kinds of sequences (weblogs)
  • Various methods From GSP to PrefixSpan
  • Periodicity analysis
  • Full periodicity, partial periodicity, cyclic
    association rules

14
Periodic Pattern
  • Full periodic pattern
  • ABC ABC ABC
  • Partial periodic pattern
  • ABC ADC ACC ABC
  • Pattern hierarchy
  • ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE
    ABC ABC ABC DE DE DE DE

Sequences of transactions
ABC3DE4
15
Sequence Association Rule Mining
  • SPADE (Sequential Pattern Discovery using
    Equivalence classes)
  • Constrained sequence mining (SPIRIT)

16
Bibliography
  • R.O. Duda, P.E. Hart, and D.G. Stork, 2001.
    Pattern Classification. 2nd Edition. Wiley
    Interscience.

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com