Dynamic Database User Session Identification with Statistical Language model

1 / 30
About This Presentation
Title:

Dynamic Database User Session Identification with Statistical Language model

Description:

Our method is based the statistical information regard to the request of database users. ... Result shows that a N-gram method with order between 2 and 8 are ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 31
Provided by: qingso

less

Transcript and Presenter's Notes

Title: Dynamic Database User Session Identification with Statistical Language model


1
Dynamic Database User Session Identification with
Statistical Language model
  • By Qingsong Yao, Xiangji Huang and Aijun An
  • qingsong, aan_at_cs.yorku.ca , jhuang_at_yorku.ca
  • York University
  • Toronto, Canada

2
Outline
  • Introduction
  • Statistical Language Model(N-gram)
  • Session Identification with N-gram Model
  • Data Description
  • Experimental Result
  • Conclusion

3
Introduction
  • Database users always submit a sequence of
    queries to do certain task, and the sequence is
    called a database user session.
  • Separating database sessions from database trace
    or workload is a critical task.
  • Traditional session separate method is Timeout
    method, which is based on the assumption that the
    think-time between two consecutive user sessions
    is longer than the think-time between two
    consecutive events within a session. It is not
    always true.
  • In the presentation, we illustrate the idea of
    using statistical language model, referred as
    N-gram model, to separate sessions. Our method
    is based the statistical information regard to
    the request of database users.

4
Statistical Language Model (N-gram)
  • What is the probability of a string W? By the
    Chain Rule we can decompose a joint probability
  • Problem Probability of the next word depends on
    entire past history, it is impossible to estimate

5
N-gram models
  • Markov assumption the probability of a word
    depends only on the probability of a limited
    history.
  • N-gram Model the probability of a word depends
    on the previous N-1 words
  • unigrams, bigrams, trigrams, 4-grams,
  • the higher n is, the more data needed to train
  • For trigrams then, the probability of a sequence
    is just the product of the conditional
    probabilities of its trigrams

6
N-GRAMS (cont.)
  • N the order of N-gram model
  • Unigrams (order 1)
  • Bigrams (order 2)
  • Trigrams (order 3)
  • Quadrigrams (order 4)

7
Smoothing
  • Determine probabilities by counting the amount of
    times wi occurs after a sequence, normalising it
    by counting how often that sequence occurs. For
    trigrams
  • The data sparsity problem training data is not
    big enough to get that much information, unseen
    sequence get 0 probability.
  • N-gram smoothing assign some probability to
    unseen data, most often used method is
    Backing-off model

8
Different Discounting Methods
  • Absolute discounting (ABS)
  • Linear discounting (LIN)
  • Good-Turing discounting (GT)
  • Witten-Bell discounting (WB)
  • Linear-Interpolation (INTER) Model
  • Nr the number of n-grams that occurs r time
  • N the number of events (uni-grams)
  • C the number if distinct words that follows
    wi-2,wi-1

9
Measurement Methods
  • Given a language model, the Perplexity of a
    sequence W is geometric average inverse
    probability
  • Remarkable fact the true model for data has the
    lowest possible perplexity. Lower the perplexity,
    the closer we are to true model.
  • The empirical Entropy of the model on W is
  • Remarkable fact entropy is average number of
    bits per word required to encode test data using
    this probability model, and an optimal coder.
    Called bits.
  • It should be called cross-entropy of model on
    test data. Lower the cross-entropy, the closer we
    are to true model

10
Session Identification with N-gram Model
  • Assumption given a consecutive sequence
    W(W1,,WN), If W crosses session boundary, P(W)
    is smaller, otherwise, P(W) is larger.
  • Problem probability is not a good measure. Is a
    probability value 0.05 small or large?
  • Use Cross-Entry value instead. Entropy value is
    usually between 0 and 10. A larger probability
    means a smaller entropy value.

11
Session Identification with N-gram Model(2)
  • Session Detection the entropy value suddenly
    increased at a session boundary

12
Performance Measurement metric
  • The corrected sessions for all test data known.
    We compare the estimated sessions with the
    corrected sessions.
  • We use both hit and false-positive rates to
    measure the accuracy of our session detection
    algorithms.
  • To better explain the trade-offs between hits
    and false-positives, we employ the F-measure.
    This standard comparison metric finds the
    harmonic mean between precision and recall is
    defined as
  • where Recall is the hit-rate (hit sessions/
    total_sessions) and Precision is the ratio of
    correct hits to proposed hits (hit
    sessions/estimated_sessions).
  • Higher F-measures indicate better overall
    performance

13
Data Description
  • Target a client/server application, DBMS is SQL
    Server 7.0. Database traces is collected by
    using SQL Profiler.
  • Training data N-Gram probabilities come from a
    training data,
  • overly narrow data probabilities don't
    generalize
  • overly general data probabilities don't reflect
    task or domain
  • We collect two training data. Data train1
    contains 9 database connections and 1,900 events
    within a 6 hour's observation time. Data train2
    contains 18 database connections and 7,244 events
    within a 10 hour's observation time.

14
Data Description (2)
  • Test data a separate test data is used to
    evaluate the model, typically using standard
    metrics.
  • Development test set one kind of test set which
    is use to choose the best N-gram model, and
    threshold value, and the trained N-gram model can
    be used to separate other test data.
  • We collect five test data sets, referred as d1,
    d2,..,d5. d1, d2 ,d3 and d4 have very similar
    behavior, and d5 contains one distinct kind task
    or session S , which is always followed by
    itself in both training data and test data. Such
    as S, S, , S, It is hard to detection
    sessions by using N-gram model.

15
Data Collection and Preprocessing
  • The collected workload contains following fields
    starttime, endtime, spid or connection id,
    query.
  • Preprocessing workload
  • User Identification using spid to identify
    users.
  • Query clustering and classification
  • Separating data value from each query, and obtain
    query template.
  • Each unique query template is assigned a label.
  • Clustering queries according to the corresponding
    query template, and replace each query field with
    the corresponding label
  • The output is a sequence of template label

16
Data Collection and reprocessing
17
Performance of Timeout Method
  • We conducted experiments with a number of
    timeout thresholds namely, 0.2, 0.5, 1 to 8, 10
    12, and 20 seconds.
  • The performance of the standard timeout session
    detection method obviously depends on the timeout
    threshold.
  • Different applications may have different best
    timeout threshold.
  • In this particular application, a threshold
    value between 3 to 10 second is good. The best
    precision value is around 70.

18
Comparison of Timeout Method and N-Gram Method
We compare the best performance of timeout
method and N-gram method. The result shows that
N-gram method outperforms the timeout method by
an range of 5 to 43.
19
Development Set
  • We choose test data d1 as our development set.
  • Experimental results shows that 3-gram model
    with Witten-Bell smoothing method achieve the
    best performance when the threshold values is
    0.16.
  • We use the trained model on other test sets. We
    observe that the trained model can detect session
    boundaries for test data d2, d3 , and d4
    successfully, but not d5. Since d5 contains a
    different kind of task which is not observed in
    d1.

20
Comparison of Different Train Data
We compare the best performance of N-Gram
method based on different training data. The
result shows train2 has better performance than
train1 for every test data.
21
Comparison of Different Smoothing Methods
  • Result shows Witten-Bell discounting
    (WB),Linear-Interpolation (INTER) and Linear
    discounting (LIN) are better than absolute
    discount (ABS) and Good-Turing discounting (GT).
  • From the entropy evolution curves shown in the
    right side, it can be observed that the entropy
    evolution curves for WB, LIN and INTER are
    sharper than that of ABS and GT.

22
Comparison of Different Smoothing Methods
Result shows that a N-gram method with order
between 2 and 8 are generally good, and the
performance of methods with a lower order (2 to
4) is better than that with a higher order (5 to
8). The best N-gram orders are usually between 2
and 4.
23
Comparison of Different Threshold Value
  • Threshold selection is a critical task of the
    N-gram model based session boundary detection
    method.
  • If the selected threshold value is too small,
    many session boundaries are missed and the
    precision rate is small.
  • A large threshold value cause many
    non-boundaries events are treated as a session
    boundary, and the recall is small.
  • Result shows that a threshold value between 5
    and 20 is suitable for our language model.

24
Automatic Threshold Selection
  • Suppose the test data has m sessions and n
    events. After we
  • estimate the entropy value for each event in the
    test data, we can sort the relative entropy
    difference values decreasingly.
  • If our language model can find all m session
    boundaries correctly, then the corresponding
    entropy difference will occupy the first m
    positions in the sorted list. Thus, the mth
    value in the sorted list is the estimated
    threshold value.
  • The results shows the performance of estimated
    threshold has a slight difference with the best
    performance.
  • Remark The estimated threshold value for
    different smoothing methods or different N-gram
    orders are different.

25
Question different detection formulas?
  • We propose six session detection formulas
  • Absolute use absolute entropy difference.
  • Relative use relative entropy difference. (Our
    standard formula).
  • Forward-1 take E(wi1) into consider, the
    relative entropy difference between E(wi1) and
    E(wi-11) must be also larger than ?.
  • Forward-N take E(wi1),E(wiN) into consider,
    all relative entropy difference must be larger
    than ?.
  • Mix-1 and Mix-2 each event has two entropy
    values for two different N-gram order n1, and n2.
    Mix-1 requires each entropy difference must be
    larger that ?. where Mix-2 requires the sum of
    that must be larger that ?. we choose n12, and
    n23,4,5,6

26
Question different detection methods?
  • We compare the best performance of these
    detection methods. The result shows that no one
    is absolutely better than the others, but the
    following relation in performance usually exists
  • Mix1,Forward-1 gt Mix2, Relative gt Forward-n,
    Absolute.
  • Method Mix-1 has an average of 3.85
    improvement, and Forward-1 has an average of
    1.97 improvement. But Forward-1 and Forward-n
    need to know the successor events which is
    implausible for online session detection, and the
    threshold value of method mix-1 and mix-2 is not
    so straightforward.

27
Question usage of doman knowledge?
  • Data Value. We can use the data value associated
    with the query template to separate sessions or
    be the supplementary of other session boundary
    separating methods. The disadvantage of this
    approach is that it is implausible to deploy an
    automatic procedure to do that.
  • Boundary Word. In database field, some
    representative boundary words are connection
    opening/closing,transaction begining/rollbacking/c
    ommiting, and user authority checking.
  • Boundary words can be used in separating step in
    which a new session is created whenever a
    boundary word is found. The boundary words can
    also be used in training step.
  • The N-gram model based on un-separated request
    sequence contains both the inter-session and the
    intra-session request frequencies.
  • We can pre-separate the request sequence
    according to the boundary words, and use the
    pre-separated sequences as the training data, in
    which some inter-session request's frequencies
    are set to 0.
  • The corresponding N-gram model is more accurate
    than using the un-separated one.
  • A special case is that we can use the correctly
    separated training session data as the training
    data.

28
Question usage of domain knowledge?
  • In the OLTP application. we found 3 boundary
    words with the help of domain expert. We
    pre-separate training data train2 by using the
    boundary word, refereed as train2". The correct
    training data is train. During the separating
    step, we have an option to choose boundary words
    or not.
  • The result shows that only using boundary words
    during separating step have an improvement of 22
    and 18 for the timeout method and N-gram method
    respectively. Training data train2" and train2'
    have an improvement of 20 and 32 respectively,
    while combining with boundary words can have an
    improvement of 21 and 33.
  • The model with the corrected training data and
    boundary words achieves the best performance, and
    achieve 98.6 accuracy for d5.

29
More different measurement metric?
  • Method1 alternative measurement to F-Measure.
  • Method2 take session length and matched length
    into consideration.
  • Cross-Entropy use cross-entropy value for self
    measurement.
  • Both are good performance measurement. A smaller
    Cross-Entropy value means a better F-Measure.

30
Conclusion
  • N-gram model based method is better than timeout
    method.
  • A n-gram model with order 2 to 4 is usually good,
    and a threshold value between 5 and 20 is
    usually good.
  • Can be applied without user identification (
    assuming the events of the same session use the
    same connection).
  • Domain Knowledge can improve the performance
    greatly.
Write a Comment
User Comments (0)