Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining

Description:

Number of Views:40

Avg rating:3.0/5.0

Slides: 24

Provided by: cecsWrig7

Learn more at: http://cecs.wright.edu

Category:

Tags: analysis | data | mining

Transcript and Presenter's Notes

Title: Data Mining

1
7. Sequence Mining

2
Sequences and Strings

A sequence x is an ordered list of discrete
items, such as a sequence of letters or a gene
sequence
Sequences and strings are often used as synonyms
String elements (characters, letters, or symbols)
are nominal
A type of particularly long string ? text
x denotes the length of sequence x
AGCTTC is 6
Any contiguous string that is part of x is called
a substring, segment, or factor of x
GCT is a factor of AGCTTC

3
Recognition with Strings

String matching
Given x and text, determine whether x is a factor
of text
Edit distance (for inexact string matching)
Given two strings x and y, compute the minimum
number of basic operations (character insertions,
deletions and exchanges) needed to transform x
into y

4
String Matching

Given text gtgt x, with characters taken from
an alphabet A
A can be 0, 1, 0, 1, 2,, 9, A,G,C,T, or
A, B,
A shift s is an offset needed to align the first
character of x with character number s1 in text
Find if there exists a valid shift where there is
a perfect match between characters in x and the
corresponding ones in text

5
Naïve (Brute-Force) String Matching

6
Boyer-Moore and KMP

7
Edit Distance

ED between x and y describes how many fundamental
operations are required to transform x to y.
Fundamental operations (xexcused,
yexhausted)
Substitutions e.g. c is replaced by h
Insertions e.g. a is inserted into x after h
Deletions e.g. a character in x is deleted
ED is one way of measuring similarity between two
strings

8
Classification using ED

Nearest-neighbor algorithm can be applied for
pattern recognition.
Training data of strings with their class labels
stored
Classification (testing) a test string is
compared to each stored string and an ED is
computed the nearest stored strings label is
assigned to the test string.
The key is how to calculate ED.
An example of calculating ED

9
Hidden Markov Model

10
Markov Model

The Markov property
given the current state, the transition
probability is independent of any previous
states.
A simple Markov Model
State ?(t) at time t
Sequence of length T
?T ?(1), ?(2), , ?(T)
Transition probability
P(? j(t1) ? i(t)) aij
Its not required that aij aji

11
Hidden Markov Model

Visible states
VT v(1), v(2), , v(T)
Emitting a visible state vk(t)
P(v k(t) ? j(t)) bjk
Only visible states vk (t) are accessible and
states ?i (t) are unobservable.
A Markov model is ergodic if every state has a
nonzero prob of occuring give some starting
state.

12
Three Key Issues with HMM

Evaluation
Given an HMM, complete with transition
probabilities aij and bjk. Determine the
probability that a particular sequence of visible
states VT was generated by that model
Decoding
Given an HMM and a set of observations VT.
Determine the most likely sequence of hidden
states ?T that led to VT.
Learning
Given the number of states and visible states and
a set of training observations of visible
symbols, determine the probabilities aij and bjk.

13
Other Sequential Patterns Mining Problems

Sequence alignment (homology) and sequence
assembly (genome sequencing)
Trend analysis
Trend movement vs. cyclic variations, seasonal
variations and random fluctuations
Sequential pattern mining
Various kinds of sequences (weblogs)
Various methods From GSP to PrefixSpan
Periodicity analysis
Full periodicity, partial periodicity, cyclic
association rules

14
Periodic Pattern