Parameter Tuning for Differential Mining of String Patterns - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Parameter Tuning for Differential Mining of String Patterns

Description:

having in mind known properties of the constraints (when applicable) and domain knowledge ... sounds too naive, needing complicated frameworks. how to sample ? ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 27
Provided by: crig
Category:

less

Transcript and Presenter's Notes

Title: Parameter Tuning for Differential Mining of String Patterns


1
Parameter Tuning for Differential Miningof
String Patterns
  • J.Besson, C. Rigotti, I. Mitasiunaite and J.-F.
    Boulicaut

2
Tuning extraction parameters
  • Local pattern mining itemsets, closed itemsets,
    episodes, seq. patterns, substrings
  • . under constraints (monotonic or not or
    neither, pattern shapes, occurrence properties,
    measures )
  • can select/focus .
  • where to look in the parameter space ?
  • often easy when a single threshold
  • but when multiple constraints/multiple
    thresholds ?

3
Two different kinds of tuning
  • 1) exploratory stage find in parameter space
    promising areas
  • 2) fine grain tuning ako greedy strategy by
    small local exploration of the parameter space

4
Tools ?
  • Best ever tool used in exploratory stage to find
    promising setting of the parameters in local
    pattern mining ???

5
Tools
  • GREP Word Count
  • method manual mix
  • count extracted patterns
  • choose points in parameter space
  • random walk
  • try local greedy strategy
  • having in mind known properties of the
    constraints (when applicable) and domain
    knowledge

6
Tools
  • when several parameters, several thresholds,
    e.g., minimal support and maximal support on
    another dataset
  • perform more exhaustive exploration of pattern
    space
  • draw curves depicting the extraction landscape

7
Tools / landscape
  • Examples

8
Obtaining extraction landscapes
  • use script - can need a lot of resources to
    execute - too much time needed to explore a large
    parameter space (several parameters)
  • use a global model of the presence of the local
    patterns to estimate the number of patterns
  • reuse/adapt a model - not so much exist
  • develop a new global model - each kind of
    patterns and each conjunction of constraints can
    be a research problem in itself
  • incorporate K of domain ? Global analytical
    model even more complex to exhibit

9
What about sampling the pattern space ?
  • sounds too naive, needing complicated frameworks
  • how to sample ?
  • size of the sample ?
  • number of pattern in the sample that satisfy the
    constraints ?
  • using domain knowledge ?
  • how to estimate value for the whole pattern space
    ?

10
What about simple choices ?
  • sampling with replacement in pat. that satisfies
    the syntactic constraints (conjunction of
    constraints)
  • number of patterns in the sample that satisfy the
    constraints
  • compute probability to satisfy the constraints
    for each patterns (incorporate K of the domain)
    in the sample
  • approx. number of patterns that sat. the
    constraints (in the sample)
  • sample size growth the sample up to convergence
    of percentage of patterns satisfying the
    constraints
  • estimate the number of patterns in the pattern
    space that satisfy the constraints percentage of
    the pat. that sat. syntactic constraints

11
Whole process
  • 1) built an initial sample of Psynt
  • 2) comp. estimate of E(N) from the sample
  • 3) add more patt. to the sample
  • 4) comp. estimate of E(N) from the sample
  • 5) if estimate changes a lot goto 3)

12
Using it in freq. substring mining
  • Two datasets R1 and R2 (two sets of strings)
  • Constraints
  • having size Z
  • appearing at least min times in R1
  • appearing no more than max times in R2
  • Consider exact and approx. matching

13
Pattern space and K of domain
  • string over an alphabet of 4 or 8 symbols
  • K of domain as three models of symbol
    distribution
  • Me - independent symbols with equal frequency
  • Md - independent symb. with different
    frequencies
  • Mm - first order Markov model
  • for given p, and Me or Md or Mm, we have the
    proba that exits at-least one occ. of p in a
    string
  • from binomial distribution we have the proba that
    p sat. min and max support constraints

14
Example / random data
  • 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of
    length 1000 in R1 and R2 , exact match

15
Example / random data
  • 4 symb. Mm, 100 strings of length 1000 in R1 and
    R2, exact and approx. match

16
Example / gene promoter seq.
  • 4 symb. A,C,G,T - Md, strings of 4000 symb., 29
    in R1 and 21 in R2 - approx. match

17
Example / gene promoter seq.
  • Estimate vs. extraction

18
Conclusion
  • Drawing extraction landscape for parameter
    tuning, in local pattern extraction, using
    pattern space sampling
  • seems possible
  • at-least in some cases
  • using simple framework
  • incorparating K of domain (to some extend -
    many works on proba of a given patt. to sat.
    constraints)
  • simplier than building a global analytical model
  • faster than running real extractions
  • sufficient in exploratory stage ?
  • companion software?

19
Example / random data
  • 8 symb. Me, 100 strings of length 30000 in R1 and
    R2, approx. match

20
Pb - Sampling / estimate
  • kind of sampling (with replacement ?)
  • specific sampling (ako stratified sampling) for
    some constraints ?
  • kinds of patterns ?
  • quality of estimates occurrences of different
    patterns are not independent

21
Pb - Other parameters added
  • size of starting set
  • convergence criterion ? 5 ?
  • size of additional subsets
  • not so hard to tune ?

22
Number of patterns
  • conjunction of constraints C
  • patterns in patt. space PS
  • for each patt. p, let var Xp1 if p sat. C or
    Xp0 if p not sat. C
  • N nb of patt. that sat. C sum of Xp over PS
  • E(N) sum of E(Xp) over PS
  • E(Xp) proba that p sat. C
  • Psynt patt. in PS that sat. syntactic
    constraint in C
  • E(N) sum of E(Xp) over Psynt

23
Number of patterns
  • comp. NS sum of E(Xp) over a sample of Psynt
  • comp. ratio NR NS/sample size
  • use NR size of Psynt as an estimate of E(N)

24
Example / gene promoter seq.
  • Estimate vs. extraction

25
Example / gene promoter seq.
  • Estimate vs. extraction

26
Often repeat exploratory stage
  • redo exploratory stage after important changes
    as
  • data selection (e.g., part of sequences)
  • encoding (e.g., mapping on event types)
  • discretization (e.g., threshold of binarization)
Write a Comment
User Comments (0)
About PowerShow.com