Title: Database Searching and Information Retrieval
1Database Searching and Information Retrieval
- Presented by
- Tushar Kumar.J
- Ritesh Bagga
2Background
- Motivation
- The main motivation behind choosing this
topic was our interest in expanding the knowledge
about the database and also due to the support
which it will provide to our research work. - Focus
- Our focus is on the various algorithms
employed to retrieve top few results from the
database. This is one of the most exciting field
in database recently.
3Introduction to Problem
- Most often we query single database.
- At times we need to query multiple databases with
- heterogeneous data.
- Difficult for user to write a single sql-query to
work on all - database.
- Solution develop a middleware system to work on
top of these - subsystems.
- This middleware divides the query into sub
queries and run - them on each individual subsystem.
4Introduction to Problem
User Query
(Color Red) AND (ShapeCircle)
Middleware System (We will study algorithms which
run on this middleware)
Color Red
Shape Circle
Redness
Circle
R3 (0.70)
R3 (1.00)
R1 (1.00)
R2 (0.50)
R2 (0.00)
R4 (0.40)
R1 (0.10)
R4 (0.00)
Aggregation Function (MIN)
Result
5Framework of this presentation
- Basic algorithms
- Comparative study of basic algorithms
- Modifications of TA algorithm
- Advance algorithms
- Related work
- How web-search engines rank the web pages ?
- Conclusion
6Basic algorithms Fagins Algorithm
- The most basic and original algorithm for solving
the problem was developed by Ron Fagin, called as
FA algorithm. -
- FA algorithm consists of following steps
- Sorted access in parallel to each of the m
lists. - Random access for every new object seen in every
other list to find - i th field x I of R.
- Use aggregation function t(R) t( xI , x 2 ..
xm) for every object - to calculate over all grade and store it in set
Y. - Define set H containing objects seen is all the
lists. - Stopping Point Set H has at least k objects.
- Sort set Y and output top k values.
7Basic algorithms Fagins Algorithm
Objects Seen
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
3.05
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
3.40
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
2.55
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
3.05
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
3.15
3.30
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.75)
2.05
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
2.65
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
Objects seen in all 4 lists
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
R8
R7
R2
8Basic algorithms Threshold Algorithm
- Similar to FA with slight modification.
-
- TA algorithm consists of following steps
- Sorted access in parallel to each of the m
lists. - Random access for every new object seen in every
other list to find - i th field x I of R.
- Use aggregation function t(R) t( xI , x 2 ..
xm) for every object - to calculate over all grade and store it in set
Y only if it belongs to - top k objects.
- Calculate threshold value T of aggregate
function after every sorted access. - Stopping Point As soon as at least k objects
have been seen - whose grade is at least equal to T.
- Return set Y which has top k values.
9Basic algorithms Threshold Algorithm
Top 3 Objects
3.90/4
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
R3(3.40/4)
3.60/4
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
R8(3.30/4)
3.30/4
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
R7(3.15/4)
3.10/4
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R5(3.05/4)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
R2(2.95/4)
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.65)
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
R10(2.65/4)
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
10Basic algorithms Comparison between TA and FA
-
- FA is optimal in some cases, but TA is optimal in
all the cases. - TA uses less buffer space, FA requires buffer
that grows with the database size. - TA may do m-1 random access for every object not
in top k set, - but FA does this random access only once for
every newly seen - object in sorted access.
-
11Modifications of TA Algorithms
- Approximation Algorithm to find the top k
elements with x - degree of approximation. Stops earlier then TA.
- Restricting Sorted Access when sorted access to
some lists are not allowed, e.g. finding best
restaurant. - Restricting Random Access
- NRA was developed when no random access was
allowed, e.g. text retrieval system. - CA was developed for situations where random
access are allowed but are very costly. Is
combination of TA and NRA, e.g. random disk
access.
12Advance algorithms
- Suppose we already have several ranked lists of
objects, the - problem here is to aggregate these lists to form
a single - ranked
- list.
- The problem can be solved using a median finding
algorithm. - Steps involved in the median finding algorithm
are - - Find out the rank of each object in
each of the ranked lists - - Find the median of the ranks obtained
from these lists for each object. - - Sort the list containing the median
ranks for these objects. - - Retrieve the results from this list.
13Advance algorithms
- Limitation of the median finding algorithm is
large number of - random accesses, which is overcome by the
MEDRANK algorithm. - MEDRANK algorithm access the ranked lists, one
element of - every list at a time, until some element is seen
in more than half of the lists.
14Related work
- In 1996, Chaudhuri and Gravano presented an
algorithm which - was built on Fagins original FA algorithm.
- In 1997 and 1998, Carey and Kossmann presented
techniques - to optimize top-k queries.
- In 1999, Nepal and Ramakrishna presented
variations on Fagins - TA algorithm for processing queries over
multimedia databases. - In 2000, Guntzer made a remarkable contribution
to the Fagins - TA algorithm by reducing the number of random
accesses. - In 2002, Chang and Zwang presented an algorithm
called as MPro to optimize the execution of
expensive predicates.
15How web-search engines rank the web pages (1)
- Web-search engines rank the web pages based on
various factors. - Some of the most commonly found web-search
engines are
- Frequency of occurrence and location are the
primary factors. - Two most important web-search engines
- Google and AltaVista
16How web-search engines rank the web pages (2)
- AltaVista
- - Maintains a huge phrase dictionary.
- - basic intuition behind the ranking of web
pages is - as follows
- It first displays all the pages containing the
phrase - - Then it displays all the pages in which the
words are - closer to each other.
- - Followed by displaying all pages containing
all the terms, - displaying pages containing any of the terms
- - Another important factor is the popularity
of search being - performed.
17How web-search engines rank the web pages (3)
- Google
- - Uses a very different technology called as
page-rank technology. - Page rank technology
- - Measures the importance of a web page by
solving an equation. - - Interprets a link as a vote.
- - Assesses a pages importance by the no. of
votes it receives. - - Important pages receives a higher rank and
appears at the top of the search results.
18Conclusion
- The literature studied signifies that much work
is done to solve the problem of retrieving top-k
results from the database. - We came across many algorithms which are very
tricky to - understand.
- The research in this field is still very active.
- Now the focus is on devising a more sophisticated
algorithm for - aggregating the ranked lists.
19References
- 1 Ronald Fagin, Combining Fuzzy Information
from Multiple Systems received July 4, 1996
revised June 22, 1998 - 2 Ronald Fagin, Combining Fuzzy Information
an Overview , Appeared in ACM SIGMOD Record 31,
2, June 2002, pages 109-118 - 3 Ronald Fagin, Amnon Lotem and Moni Naor.
Optimal aggregation algorithms for middleware
Computer and System Sciences 66 (2003), pp.
614-656. Extended abstract appeared in Proc. 2001
ACM Symposium on Principles of Database Systems
(PODS '01), pp. 102-113. - 4 Ronald Fagin, Ravi Kumar and D. Sivakumar.
Efficient similarity search and classification
via rank Aggregation Proc. 2003 ACM SIGMOD
Conference (SIGMOD '03), pp. 301-312. - 5 Ronald Fagin, Ravi Kumar, Mohammad Mahdian,
D. Sivakumar, and Erik Vee. Comparing and
Aggregating Rankings with Ties Proc. 2004 ACM
Symposium on Principles of Database Systems (PODS
'04), pp. 47-58. - 6 Ronald Fagin, Ravi Kumar, and D. SivaKumar.
COMPARING TOP k LISTS SIAM J. Discrete
Mathematics 17, 1 (2003), pp. 134-160. Extended
abstract in 2003 ACM-SIAM Symposium on Discrete
Algorithms (SODA '03), pp. 28-36. - 7 A. Marian, N. Bruno, and L. Gravano.
Evaluating Top- k Queries over Web-Accessible
Databases Accepted for publication in ACM
Transactions on Database Systems, 2003. - 8 Martin P. Courtois and Michael W.Berry,
Results Ranking in Web Search Engines online
may 1999.
20- Thank you!
-
- Any Questions?