Database Searching and Information Retrieval - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Database Searching and Information Retrieval

Description:

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga Background Motivation The main motivation behind choosing this topic was our ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 21

Provided by: rangerUt4

Learn more at: http://ranger.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Database Searching and Information Retrieval

1
Database Searching and Information Retrieval

Presented by
Tushar Kumar.J
Ritesh Bagga

2
Background

Motivation
The main motivation behind choosing this
topic was our interest in expanding the knowledge
about the database and also due to the support
which it will provide to our research work.
Focus
Our focus is on the various algorithms
employed to retrieve top few results from the
database. This is one of the most exciting field
in database recently.

3
Introduction to Problem

Most often we query single database.
At times we need to query multiple databases with
heterogeneous data.
Difficult for user to write a single sql-query to
work on all
database.
Solution develop a middleware system to work on
top of these
subsystems.
This middleware divides the query into sub
queries and run
them on each individual subsystem.

4
Introduction to Problem
User Query
(Color Red) AND (ShapeCircle)
Middleware System (We will study algorithms which
run on this middleware)
Color Red
Shape Circle
Redness
Circle
R3 (0.70)
R3 (1.00)
R1 (1.00)
R2 (0.50)
R2 (0.00)
R4 (0.40)
R1 (0.10)
R4 (0.00)
Aggregation Function (MIN)
Result
5
Framework of this presentation

Basic algorithms
Comparative study of basic algorithms
Modifications of TA algorithm
Advance algorithms
Related work
How web-search engines rank the web pages ?
Conclusion

6
Basic algorithms Fagins Algorithm

The most basic and original algorithm for solving
the problem was developed by Ron Fagin, called as
FA algorithm.
FA algorithm consists of following steps
Sorted access in parallel to each of the m
lists.
Random access for every new object seen in every
other list to find
i th field x I of R.
Use aggregation function t(R) t( xI , x 2 ..
xm) for every object
to calculate over all grade and store it in set
Y.
Define set H containing objects seen is all the
lists.
Stopping Point Set H has at least k objects.
Sort set Y and output top k values.

7
Basic algorithms Fagins Algorithm
Objects Seen
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
3.05
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
3.40
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
2.55
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
3.05
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
3.15
3.30
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.75)
2.05
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
2.65
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
Objects seen in all 4 lists
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
R8
R7
R2
8
Basic algorithms Threshold Algorithm

Similar to FA with slight modification.
TA algorithm consists of following steps
Sorted access in parallel to each of the m
lists.
Random access for every new object seen in every
other list to find
i th field x I of R.
Use aggregation function t(R) t( xI , x 2 ..
xm) for every object
to calculate over all grade and store it in set
Y only if it belongs to
top k objects.
Calculate threshold value T of aggregate
function after every sorted access.
Stopping Point As soon as at least k objects
have been seen
whose grade is at least equal to T.
Return set Y which has top k values.

9
Basic algorithms Threshold Algorithm
Top 3 Objects
3.90/4
R8(0.95)
R3(0.95)
R5(1.00)
R10(1.00)
R3(3.40/4)
3.60/4
R7(0.95)
R10(0.80)
R2(0.90)
R3(0.95)
R8(3.30/4)
3.30/4
R4(0.70)
R8(0.90)
R5(0.85)
R7(0.85)
R7(3.15/4)
3.10/4
R2(0.85)
R8(0.65)
R3(0.80)
R8(0.80)
R5(3.05/4)
R4(0.80)
R5(0.75)
R7(0.60)
R7(0.75)
R2(2.95/4)
R3(0.70)
R2(0.55)
R9(0.70)
R2(0.65)
R4(0.65)
R1(0.65)
R9(0.50)
R6(0.60)
R10(2.65/4)
R1(0.60)
R5(0.45)
R9(0.55)
R1(0.50)
R4(0.40)
R6(0.45)
R10(0.55)
R6(0.40)
R1(0.30)
R6(0.50)
R9(0.30)
R10(0.30)
10
Basic algorithms Comparison between TA and FA

FA is optimal in some cases, but TA is optimal in
all the cases.
TA uses less buffer space, FA requires buffer
that grows with the database size.
TA may do m-1 random access for every object not
in top k set,
but FA does this random access only once for
every newly seen
object in sorted access.

11
Modifications of TA Algorithms

Approximation Algorithm to find the top k
elements with x
degree of approximation. Stops earlier then TA.
Restricting Sorted Access when sorted access to
some lists are not allowed, e.g. finding best
restaurant.
Restricting Random Access
NRA was developed when no random access was
allowed, e.g. text retrieval system.
CA was developed for situations where random
access are allowed but are very costly. Is
combination of TA and NRA, e.g. random disk
access.

12
Advance algorithms

Suppose we already have several ranked lists of
objects, the
problem here is to aggregate these lists to form
a single
ranked
list.
The problem can be solved using a median finding
algorithm.
Steps involved in the median finding algorithm
are
- Find out the rank of each object in
each of the ranked lists
- Find the median of the ranks obtained
from these lists for each object.
- Sort the list containing the median
ranks for these objects.
- Retrieve the results from this list.

13
Advance algorithms

Limitation of the median finding algorithm is
large number of
random accesses, which is overcome by the
MEDRANK algorithm.
MEDRANK algorithm access the ranked lists, one
element of
every list at a time, until some element is seen
in more than half of the lists.

14
Related work

In 1996, Chaudhuri and Gravano presented an
algorithm which
was built on Fagins original FA algorithm.
In 1997 and 1998, Carey and Kossmann presented
techniques
to optimize top-k queries.
In 1999, Nepal and Ramakrishna presented
variations on Fagins
TA algorithm for processing queries over
multimedia databases.
In 2000, Guntzer made a remarkable contribution
to the Fagins
TA algorithm by reducing the number of random
accesses.
In 2002, Chang and Zwang presented an algorithm
called as MPro to optimize the execution of
expensive predicates.

15
How web-search engines rank the web pages (1)

Web-search engines rank the web pages based on
various factors.
Some of the most commonly found web-search
engines are

Frequency of occurrence and location are the
primary factors.
Two most important web-search engines
Google and AltaVista

16
How web-search engines rank the web pages (2)

AltaVista
- Maintains a huge phrase dictionary.
- basic intuition behind the ranking of web
pages is
as follows
It first displays all the pages containing the
phrase
- Then it displays all the pages in which the
words are
closer to each other.
- Followed by displaying all pages containing
all the terms,
displaying pages containing any of the terms
- Another important factor is the popularity
of search being
performed.

17
How web-search engines rank the web pages (3)

Google
- Uses a very different technology called as
page-rank technology.
Page rank technology
- Measures the importance of a web page by
solving an equation.
- Interprets a link as a vote.
- Assesses a pages importance by the no. of
votes it receives.
- Important pages receives a higher rank and
appears at the top of the search results.

18
Conclusion

The literature studied signifies that much work
is done to solve the problem of retrieving top-k
results from the database.
We came across many algorithms which are very
tricky to
understand.
The research in this field is still very active.
Now the focus is on devising a more sophisticated
algorithm for
aggregating the ranked lists.

19
References

1 Ronald Fagin, Combining Fuzzy Information
from Multiple Systems received July 4, 1996
revised June 22, 1998
2 Ronald Fagin, Combining Fuzzy Information
an Overview , Appeared in ACM SIGMOD Record 31,
2, June 2002, pages 109-118
3 Ronald Fagin, Amnon Lotem and Moni Naor.
Optimal aggregation algorithms for middleware
Computer and System Sciences 66 (2003), pp.
614-656. Extended abstract appeared in Proc. 2001
ACM Symposium on Principles of Database Systems
(PODS '01), pp. 102-113.
4 Ronald Fagin, Ravi Kumar and D. Sivakumar.
Efficient similarity search and classification
via rank Aggregation Proc. 2003 ACM SIGMOD
Conference (SIGMOD '03), pp. 301-312.
5 Ronald Fagin, Ravi Kumar, Mohammad Mahdian,
D. Sivakumar, and Erik Vee. Comparing and
Aggregating Rankings with Ties Proc. 2004 ACM
Symposium on Principles of Database Systems (PODS
'04), pp. 47-58.
6 Ronald Fagin, Ravi Kumar, and D. SivaKumar.
COMPARING TOP k LISTS SIAM J. Discrete
Mathematics 17, 1 (2003), pp. 134-160. Extended
abstract in 2003 ACM-SIAM Symposium on Discrete
Algorithms (SODA '03), pp. 28-36.
7 A. Marian, N. Bruno, and L. Gravano.
Evaluating Top- k Queries over Web-Accessible
Databases Accepted for publication in ACM
Transactions on Database Systems, 2003.
8 Martin P. Courtois and Michael W.Berry,
Results Ranking in Web Search Engines online
may 1999.