Mining%20di%20Dati%20Web - PowerPoint PPT Presentation

About This Presentation

Title:

Mining%20di%20Dati%20Web

Description:

Each component of a WSE records information about its operations. ... Belkin, N.J., et al. 'Rutgers' TREC 2001 Interactive Track Experience', in ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 55

Provided by: fabrizios7

Category:

more less

Transcript and Presenter's Notes

Title: Mining%20di%20Dati%20Web

1
Mining di Dati Web

Web Search Engines Query Log Mining
A.A 2006/2007

2
Whats Recorded in a WSE Query Log?

Each component of a WSE records information about
its operations.
We are mainly concerned with frontend logs.
They record each query submitted to the WSE.

3
Data Recorded

Among other information WSEs record
The query topic.
The first result wanted.
The number of results wanted.
Some examples
q(fabrizio silvestri)f(1)n(10)
q(information retrieval)f(5)n(15)
Some other information
The language.
Results folded? (Y/N).
Etc.

Commonly referred to as the query
4
What Can We Look For?

The most popular queries.
How queries are distributed.
How queries are related.
How subsequent queries are related.
How topics are distributed.
How topics change throughout the 24 hours.
Can we exploit this information?

5
Lets Start Looking at Some Interesting Items

What are the most popular queries?

6
Most Popular Topics
7
Most Popular Terms
8
What Are Users Doing?

Not typing many words!
Average query was 2.6 words long (in 2001), up
from 2.4 words in 1997.
Moving toward e-commerce
Less sex (down from 17 to 9), more business (up
from 13 to 25).
Spink A., et al. From e-Sex to e-Commerce Web
Search Changes, Computer, March 2002.

9
Why Are Queries so Short?

Users minimize effort.
Users dont realize more information is better.
Users learn that too many words belongs to fewer
results. (Since implicit AND)
Query Boxes are Small.
Belkin, N.J., et al. Rutgers TREC 2001
Interactive Track Experience, in Voorhees
Harmon, The Tenth Text Retrieval Conference.

10
Different Kind of Queries
11
Distribution of Query Types
12
Hourly Analysis of a Query Log

Steven M. Beitzel, Eric C. Jensen, Abdur
Chowdhury, David Grossman, Ophir Frieder, "Hourly
Analysis of a Very Large Topically Categorized
Web Query Log", Proceedings of the 2004 ACM
Conference on Research and Development in
Information Retrieval (ACM-SIGIR), Sheffield, UK,
July 2004.

13
Frequency Time Distribution
14
Query Repetition
15
Query Categories
16
Categories over Time
17
Analysis of Three Query Logs

Tiziano Fagni, Salvatore Orlando, Raffaele
Perego, Fabrizio Silvestri. Boosting the
Performance of Web Search Engines Caching and
Prefetching Query Results by Exploiting
Historical Usage Data. ACM Transactions on
Information Systems. 24(1). January 2006.

18
Temporal Locality
?0.66
19
Query Submission Distance
20
Page Requested
21
Subsequent Page Requests
22
Query Caching
Index
WSE
Francesca, 1
Francesca, 1
Results
Francesca
23
Caching Who Care?!?

Successful caching of query results can
Lower the number/cost of query executions.
Shorten the engines response time.
Increase the engines throughput.

24
Caching How-To?

Caching can exploit locality of reference in the
query streams search engines are faced with.
Query popularity follows a power-law and vary
widely, from the extremely popular to the very
rare.

25
Caching What to Measure?

Hit Ratio
Let N be the number of requests to the WSE
Let H be the number of hits - i.e. the number of
queries that can be answered by the cache.
The Hit Ratio HR is defined as HR H/N. Usually
is expressed in percentage.
E.g. HR 30 means that the thirty percent of
the queries are satisfied using the cache.
Alternatively we could define the Miss Ratio MR
1 - HR M/N. Where M is the number of miss -
i.e. the number of queries that cannot be
answered by the query.

26
What About the Throughput?

The throughput is defined as the number of
queries answered per-second.
Caching, in general, rises the throughput.
The lower the hit-ratio the lower the throughput.
The lower the cache response-time the higher the
throughput.

27
Caching Complexity

The caching response time depends on the
replacement policy complexity.
The complexity usually depends on the cache size
K.
There exists policies that are
O(1) - i.e. constant. They dont depend on the
size of the cache.
O(log K).
O(N).

28
Is There Only Caching?

No!!!!
Theres also PREFETCHING!
Whats Prefetching
Anticipating users query by exploiting query
stream properties
Uhuuuu! Sounds like kind of Usage Mining!
For instance lets have a look at the probability
of subsequent page requests.
Prefetching factor p is the number of pages
prefetched.

29
Prefetching PROS and CONS

Prefetching enhance hit-ratio.
Prefetching reduce the query load on the query
server.
The cost for computing p pages of results is
approx the same of computing only one page
Prefetching is very likely to load pages that
will never be requested in future.

30
Adaptive Prefetching
31
Theoretical Bounds
32
Some Classical Caching Policies

LRU
Last Recently Used.
Evict from Cache the query results that have been
accessed farthest in the past.
SLRU
Two segments
Probationary
Protected.
Lines in each segment are ordered from the most
to the least recently accessed. Data from misses
is added to the cache at the most recently
accessed end of the probationary segment. Hits
are removed from wherever they currently reside
and added to the most recently accessed end of
the protected segment. Lines in the protected
segment have thus been accessed at least twice.
The protected segment is finite, so migration of
a line from the probationary segment to the
protected segment may force the migration of the
LRU line in the protected segment to the most
recently used (MRU) end of the probationary
segment, giving this line another chance to be
accessed before being replaced.

33
Problems

Classical Replacement Policies do not care about
stream characteristics.
They are not designed using usage mining
investigation techniques.
They offer godd performance, though!
Uhmmm. Are you sure?!? Stay tuned!

34
Caching May be Attacked from two Directions

Architecture of the caching system
Two-level caching
Three-level caching
SDC
Replacement policy
PDC
SDC
Both
SDC

35
Two-level Caching

Cache of Query Results
Cache of Inverted Lists
Both

36
Throughput
37
Three-level Caching

Long, X. and Suel, T. 2005. Three-level caching
for efficient query processing in large Web
search engines. In Proceedings of the 14th
international Conference on World Wide Web
(Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM
Press, New York, NY, 257-266.

38
Probability Driven Caching

Lempel, R. and Moran, S. 2003. Predictive caching
and prefetching of query results in search
engines. In Proceedings of the 12th international
Conference on World Wide Web (Budapest, Hungary,
May 20 - 24, 2003). WWW '03. ACM Press, New York,
NY, 19-28.
Tanks to Ronny for his original slides!

39
Static-Dynamic Caching

Tiziano Fagni, Salvatore Orlando, Raffaele
Perego, Fabrizio Silvestri. Boosting the
Performance of Web Search Engines Caching and
Prefetching Query Results by Exploiting
Historical Usage Data. ACM Transactions on
Information Systems. 24(1). January 2006.
Idea
Divide the cache in two sets
Static Set
Dynamic Set.
Fill the Static Set using the most frequently
submitted query in the past.
The Static Set is read-only good in
multithreaded architectures.

40
Inside SDC

Static-Dynamic Caching.
The cache is divided into two sets
Static Set contains the results of the queries
most frequently submitted so far.
Dynamic Set is implemented using a classical
caching replacement policy like, for instance,
LRU, SLRU, PDC.
The Static Set size is given by fstaticN. Where
0lt fstatic lt 1 is the fraction of the total
entries (N) of the cache devoted to the Static
Set.
Adaptive Prefetching is adopted.

41
Benefits in Real-World Caches
SDC Cache
SDC Cache Thread
SDC Cache Thread
SDC Cache Thread
SDC Cache Thread
Static Set
Dynamic Set
Mutex
WSE
42
SDC Performance

Linux PC 2GHz Pentium Xeon - 1GB RAM
Single process.
fstatic 0.5. No prefetching.

43
SDC Hit-Ratio
44
SDC Hit-Ratio
45
SDC Hit-Ratio
46
SDC Hit-Ratio
47
SDC Hit-Ratio
48
SDC Hit-Ratio
49
SDC Hit-Ratio
50
SDC Hit-Ratio
51
Why Static Set Helps?
52
Concurrent Caching
53
Freshness of the Training Data

How frequently should we perform mining again on
the usage data?
Does performance of Usage-Mining-based caching
degrades gracefully as time goes by?
Do time-of-day patterns exist in query stream.

54
Daily Patterns

Write a Comment

User Comments (0)