Title: Dynamic Database User Session Identification with Statistical Language model
1Dynamic Database User Session Identification with
Statistical Language model
- By Qingsong Yao, Xiangji Huang and Aijun An
- qingsong, aan_at_cs.yorku.ca , jhuang_at_yorku.ca
- York University
- Toronto, Canada
2Outline
- Introduction
- Statistical Language Model(N-gram)
- Session Identification with N-gram Model
- Data Description
- Experimental Result
- Conclusion
3Introduction
- Database users always submit a sequence of
queries to do certain task, and the sequence is
called a database user session. - Separating database sessions from database trace
or workload is a critical task. - Traditional session separate method is Timeout
method, which is based on the assumption that the
think-time between two consecutive user sessions
is longer than the think-time between two
consecutive events within a session. It is not
always true. - In the presentation, we illustrate the idea of
using statistical language model, referred as
N-gram model, to separate sessions. Our method
is based the statistical information regard to
the request of database users.
4Statistical Language Model (N-gram)
- What is the probability of a string W? By the
Chain Rule we can decompose a joint probability - Problem Probability of the next word depends on
entire past history, it is impossible to estimate
5N-gram models
- Markov assumption the probability of a word
depends only on the probability of a limited
history. - N-gram Model the probability of a word depends
on the previous N-1 words - unigrams, bigrams, trigrams, 4-grams,
- the higher n is, the more data needed to train
- For trigrams then, the probability of a sequence
is just the product of the conditional
probabilities of its trigrams
6N-GRAMS (cont.)
- N the order of N-gram model
- Unigrams (order 1)
- Bigrams (order 2)
- Trigrams (order 3)
- Quadrigrams (order 4)
7Smoothing
- Determine probabilities by counting the amount of
times wi occurs after a sequence, normalising it
by counting how often that sequence occurs. For
trigrams - The data sparsity problem training data is not
big enough to get that much information, unseen
sequence get 0 probability. - N-gram smoothing assign some probability to
unseen data, most often used method is
Backing-off model
8Different Discounting Methods
- Absolute discounting (ABS)
- Linear discounting (LIN)
- Good-Turing discounting (GT)
- Witten-Bell discounting (WB)
- Linear-Interpolation (INTER) Model
- Nr the number of n-grams that occurs r time
- N the number of events (uni-grams)
- C the number if distinct words that follows
wi-2,wi-1
9Measurement Methods
- Given a language model, the Perplexity of a
sequence W is geometric average inverse
probability - Remarkable fact the true model for data has the
lowest possible perplexity. Lower the perplexity,
the closer we are to true model. - The empirical Entropy of the model on W is
- Remarkable fact entropy is average number of
bits per word required to encode test data using
this probability model, and an optimal coder.
Called bits. - It should be called cross-entropy of model on
test data. Lower the cross-entropy, the closer we
are to true model
10Session Identification with N-gram Model
- Assumption given a consecutive sequence
W(W1,,WN), If W crosses session boundary, P(W)
is smaller, otherwise, P(W) is larger. - Problem probability is not a good measure. Is a
probability value 0.05 small or large? - Use Cross-Entry value instead. Entropy value is
usually between 0 and 10. A larger probability
means a smaller entropy value.
11Session Identification with N-gram Model(2)
- Session Detection the entropy value suddenly
increased at a session boundary
12Performance Measurement metric
- The corrected sessions for all test data known.
We compare the estimated sessions with the
corrected sessions. - We use both hit and false-positive rates to
measure the accuracy of our session detection
algorithms. - To better explain the trade-offs between hits
and false-positives, we employ the F-measure.
This standard comparison metric finds the
harmonic mean between precision and recall is
defined as - where Recall is the hit-rate (hit sessions/
total_sessions) and Precision is the ratio of
correct hits to proposed hits (hit
sessions/estimated_sessions). - Higher F-measures indicate better overall
performance
13Data Description
- Target a client/server application, DBMS is SQL
Server 7.0. Database traces is collected by
using SQL Profiler. - Training data N-Gram probabilities come from a
training data, - overly narrow data probabilities don't
generalize - overly general data probabilities don't reflect
task or domain - We collect two training data. Data train1
contains 9 database connections and 1,900 events
within a 6 hour's observation time. Data train2
contains 18 database connections and 7,244 events
within a 10 hour's observation time.
14Data Description (2)
- Test data a separate test data is used to
evaluate the model, typically using standard
metrics. - Development test set one kind of test set which
is use to choose the best N-gram model, and
threshold value, and the trained N-gram model can
be used to separate other test data. - We collect five test data sets, referred as d1,
d2,..,d5. d1, d2 ,d3 and d4 have very similar
behavior, and d5 contains one distinct kind task
or session S , which is always followed by
itself in both training data and test data. Such
as S, S, , S, It is hard to detection
sessions by using N-gram model.
15Data Collection and Preprocessing
- The collected workload contains following fields
starttime, endtime, spid or connection id,
query. - Preprocessing workload
- User Identification using spid to identify
users. - Query clustering and classification
- Separating data value from each query, and obtain
query template. - Each unique query template is assigned a label.
- Clustering queries according to the corresponding
query template, and replace each query field with
the corresponding label - The output is a sequence of template label
16Data Collection and reprocessing
17Performance of Timeout Method
- We conducted experiments with a number of
timeout thresholds namely, 0.2, 0.5, 1 to 8, 10
12, and 20 seconds. - The performance of the standard timeout session
detection method obviously depends on the timeout
threshold. - Different applications may have different best
timeout threshold. - In this particular application, a threshold
value between 3 to 10 second is good. The best
precision value is around 70.
18Comparison of Timeout Method and N-Gram Method
We compare the best performance of timeout
method and N-gram method. The result shows that
N-gram method outperforms the timeout method by
an range of 5 to 43.
19Development Set
- We choose test data d1 as our development set.
- Experimental results shows that 3-gram model
with Witten-Bell smoothing method achieve the
best performance when the threshold values is
0.16. - We use the trained model on other test sets. We
observe that the trained model can detect session
boundaries for test data d2, d3 , and d4
successfully, but not d5. Since d5 contains a
different kind of task which is not observed in
d1.
20Comparison of Different Train Data
We compare the best performance of N-Gram
method based on different training data. The
result shows train2 has better performance than
train1 for every test data.
21Comparison of Different Smoothing Methods
- Result shows Witten-Bell discounting
(WB),Linear-Interpolation (INTER) and Linear
discounting (LIN) are better than absolute
discount (ABS) and Good-Turing discounting (GT). - From the entropy evolution curves shown in the
right side, it can be observed that the entropy
evolution curves for WB, LIN and INTER are
sharper than that of ABS and GT.
22Comparison of Different Smoothing Methods
Result shows that a N-gram method with order
between 2 and 8 are generally good, and the
performance of methods with a lower order (2 to
4) is better than that with a higher order (5 to
8). The best N-gram orders are usually between 2
and 4.
23Comparison of Different Threshold Value
- Threshold selection is a critical task of the
N-gram model based session boundary detection
method. - If the selected threshold value is too small,
many session boundaries are missed and the
precision rate is small. - A large threshold value cause many
non-boundaries events are treated as a session
boundary, and the recall is small. - Result shows that a threshold value between 5
and 20 is suitable for our language model.
24Automatic Threshold Selection
- Suppose the test data has m sessions and n
events. After we - estimate the entropy value for each event in the
test data, we can sort the relative entropy
difference values decreasingly. - If our language model can find all m session
boundaries correctly, then the corresponding
entropy difference will occupy the first m
positions in the sorted list. Thus, the mth
value in the sorted list is the estimated
threshold value. - The results shows the performance of estimated
threshold has a slight difference with the best
performance. - Remark The estimated threshold value for
different smoothing methods or different N-gram
orders are different.
25Question different detection formulas?
- We propose six session detection formulas
- Absolute use absolute entropy difference.
- Relative use relative entropy difference. (Our
standard formula). - Forward-1 take E(wi1) into consider, the
relative entropy difference between E(wi1) and
E(wi-11) must be also larger than ?. - Forward-N take E(wi1),E(wiN) into consider,
all relative entropy difference must be larger
than ?. - Mix-1 and Mix-2 each event has two entropy
values for two different N-gram order n1, and n2.
Mix-1 requires each entropy difference must be
larger that ?. where Mix-2 requires the sum of
that must be larger that ?. we choose n12, and
n23,4,5,6
26Question different detection methods?
- We compare the best performance of these
detection methods. The result shows that no one
is absolutely better than the others, but the
following relation in performance usually exists - Mix1,Forward-1 gt Mix2, Relative gt Forward-n,
Absolute. - Method Mix-1 has an average of 3.85
improvement, and Forward-1 has an average of
1.97 improvement. But Forward-1 and Forward-n
need to know the successor events which is
implausible for online session detection, and the
threshold value of method mix-1 and mix-2 is not
so straightforward.
27Question usage of doman knowledge?
- Data Value. We can use the data value associated
with the query template to separate sessions or
be the supplementary of other session boundary
separating methods. The disadvantage of this
approach is that it is implausible to deploy an
automatic procedure to do that. - Boundary Word. In database field, some
representative boundary words are connection
opening/closing,transaction begining/rollbacking/c
ommiting, and user authority checking. - Boundary words can be used in separating step in
which a new session is created whenever a
boundary word is found. The boundary words can
also be used in training step. - The N-gram model based on un-separated request
sequence contains both the inter-session and the
intra-session request frequencies. - We can pre-separate the request sequence
according to the boundary words, and use the
pre-separated sequences as the training data, in
which some inter-session request's frequencies
are set to 0. - The corresponding N-gram model is more accurate
than using the un-separated one. - A special case is that we can use the correctly
separated training session data as the training
data.
28Question usage of domain knowledge?
- In the OLTP application. we found 3 boundary
words with the help of domain expert. We
pre-separate training data train2 by using the
boundary word, refereed as train2". The correct
training data is train. During the separating
step, we have an option to choose boundary words
or not. - The result shows that only using boundary words
during separating step have an improvement of 22
and 18 for the timeout method and N-gram method
respectively. Training data train2" and train2'
have an improvement of 20 and 32 respectively,
while combining with boundary words can have an
improvement of 21 and 33. - The model with the corrected training data and
boundary words achieves the best performance, and
achieve 98.6 accuracy for d5.
29More different measurement metric?
- Method1 alternative measurement to F-Measure.
- Method2 take session length and matched length
into consideration. - Cross-Entropy use cross-entropy value for self
measurement. - Both are good performance measurement. A smaller
Cross-Entropy value means a better F-Measure.
30Conclusion
- N-gram model based method is better than timeout
method. - A n-gram model with order 2 to 4 is usually good,
and a threshold value between 5 and 20 is
usually good. - Can be applied without user identification (
assuming the events of the same session use the
same connection). - Domain Knowledge can improve the performance
greatly.