Title: Parameter Tuning for Differential Mining of String Patterns
1Parameter Tuning for Differential Miningof
String Patterns
- J.Besson, C. Rigotti, I. Mitasiunaite and J.-F.
Boulicaut
2Tuning extraction parameters
- Local pattern mining itemsets, closed itemsets,
episodes, seq. patterns, substrings - . under constraints (monotonic or not or
neither, pattern shapes, occurrence properties,
measures ) - can select/focus .
- where to look in the parameter space ?
- often easy when a single threshold
- but when multiple constraints/multiple
thresholds ?
3Two different kinds of tuning
- 1) exploratory stage find in parameter space
promising areas - 2) fine grain tuning ako greedy strategy by
small local exploration of the parameter space
4Tools ?
- Best ever tool used in exploratory stage to find
promising setting of the parameters in local
pattern mining ???
5Tools
- GREP Word Count
- method manual mix
- count extracted patterns
- choose points in parameter space
- random walk
- try local greedy strategy
- having in mind known properties of the
constraints (when applicable) and domain
knowledge
6Tools
- when several parameters, several thresholds,
e.g., minimal support and maximal support on
another dataset - perform more exhaustive exploration of pattern
space - draw curves depicting the extraction landscape
7Tools / landscape
8Obtaining extraction landscapes
- use script - can need a lot of resources to
execute - too much time needed to explore a large
parameter space (several parameters) - use a global model of the presence of the local
patterns to estimate the number of patterns - reuse/adapt a model - not so much exist
- develop a new global model - each kind of
patterns and each conjunction of constraints can
be a research problem in itself - incorporate K of domain ? Global analytical
model even more complex to exhibit
9What about sampling the pattern space ?
- sounds too naive, needing complicated frameworks
- how to sample ?
- size of the sample ?
- number of pattern in the sample that satisfy the
constraints ? - using domain knowledge ?
- how to estimate value for the whole pattern space
?
10What about simple choices ?
- sampling with replacement in pat. that satisfies
the syntactic constraints (conjunction of
constraints) - number of patterns in the sample that satisfy the
constraints - compute probability to satisfy the constraints
for each patterns (incorporate K of the domain)
in the sample - approx. number of patterns that sat. the
constraints (in the sample) - sample size growth the sample up to convergence
of percentage of patterns satisfying the
constraints - estimate the number of patterns in the pattern
space that satisfy the constraints percentage of
the pat. that sat. syntactic constraints
11Whole process
- 1) built an initial sample of Psynt
- 2) comp. estimate of E(N) from the sample
- 3) add more patt. to the sample
- 4) comp. estimate of E(N) from the sample
- 5) if estimate changes a lot goto 3)
12Using it in freq. substring mining
- Two datasets R1 and R2 (two sets of strings)
- Constraints
- having size Z
- appearing at least min times in R1
- appearing no more than max times in R2
- Consider exact and approx. matching
13Pattern space and K of domain
- string over an alphabet of 4 or 8 symbols
- K of domain as three models of symbol
distribution - Me - independent symbols with equal frequency
- Md - independent symb. with different
frequencies - Mm - first order Markov model
- for given p, and Me or Md or Mm, we have the
proba that exits at-least one occ. of p in a
string - from binomial distribution we have the proba that
p sat. min and max support constraints
14Example / random data
- 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of
length 1000 in R1 and R2 , exact match
15Example / random data
- 4 symb. Mm, 100 strings of length 1000 in R1 and
R2, exact and approx. match
16Example / gene promoter seq.
- 4 symb. A,C,G,T - Md, strings of 4000 symb., 29
in R1 and 21 in R2 - approx. match
17Example / gene promoter seq.
18Conclusion
- Drawing extraction landscape for parameter
tuning, in local pattern extraction, using
pattern space sampling - seems possible
- at-least in some cases
- using simple framework
- incorparating K of domain (to some extend -
many works on proba of a given patt. to sat.
constraints) - simplier than building a global analytical model
- faster than running real extractions
- sufficient in exploratory stage ?
- companion software?
19Example / random data
- 8 symb. Me, 100 strings of length 30000 in R1 and
R2, approx. match
20Pb - Sampling / estimate
- kind of sampling (with replacement ?)
- specific sampling (ako stratified sampling) for
some constraints ? - kinds of patterns ?
- quality of estimates occurrences of different
patterns are not independent
21Pb - Other parameters added
- size of starting set
- convergence criterion ? 5 ?
- size of additional subsets
- not so hard to tune ?
22Number of patterns
- conjunction of constraints C
- patterns in patt. space PS
- for each patt. p, let var Xp1 if p sat. C or
Xp0 if p not sat. C - N nb of patt. that sat. C sum of Xp over PS
- E(N) sum of E(Xp) over PS
- E(Xp) proba that p sat. C
- Psynt patt. in PS that sat. syntactic
constraint in C - E(N) sum of E(Xp) over Psynt
23Number of patterns
- comp. NS sum of E(Xp) over a sample of Psynt
- comp. ratio NR NS/sample size
- use NR size of Psynt as an estimate of E(N)
24Example / gene promoter seq.
25Example / gene promoter seq.
26Often repeat exploratory stage
- redo exploratory stage after important changes
as - data selection (e.g., part of sequences)
- encoding (e.g., mapping on event types)
- discretization (e.g., threshold of binarization)
-