Title: Databases and Information Retrieval
1Databases and Information Retrieval Lecture
1 Basics of Databases and Information Retrieval
Instructor Mr. Gautam Das University of Texas at
Arlington Email gdas_at_cse.uta.edu
2Database
IR
- Data
- Collection of Documents Unstructured piece of
information - Follows Rank and Relevance query model
- Output is the document
- Consist of Schema
- Relational Model
- Data stored in form of tables
- Follow typical Query Model and Joins
- Output in form of tuples which are made of joins
from one or more tables
3Types of Queries
- Conjunctive Queries
- Car , Accident
- Will search for the word either Car or
Accident. - General Boolean Queries
- Car Accident Arlington
- Will Search for words Car and Accident
but should not have word Arlington.
4Retrieval Models of IR
- Boolean Retrieval Model
- Ranked / Relevance Retrieval Model
- One which is missing in databases
5Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 1
- Occurrence and Frequency
- The number of times the specified word occurs in
the document decides the rank - The position it occurs at e.g. Title, Sub Title.
6Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 2
- Proximity
- If two or more words are specified in the search
string then the documents containing those words
near to each other should be ranked higher.
7Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 3
- Stemming
- Uses various verbal forms of word for seraching.
- E.g. Run gt Ran, Run over, Running
- Exact match of word should be ranked higher
- E.g. If the word info is searched then the
document containing word infotech should be
ranked after the document containing exact match
as info.
8Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 4
- Frequency across Documents
- The words like a, an, the etc. should be
suppressed as more probability is that those are
irrelevant as far as searching criteria is
concerned. - If we are searching for Microsoft Corporation
then the specific word Microsoft is more
important than the general word Corporation
9Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 5
- Page Access Frequency
- If the page is accessed more number of times i.e.
If the page is popular then it should be ranked
higher - This kind of ranking requires to maintain log
about the frequency of page access - Useful in case of systems which store News,
Stories or readable articles.
10Parameters Used for Ranking in Typical
Information Retrieval System
- Parameter 6
- Number of In-Links to the Page
- It is the number of times other pages on web are
having links to the page be ranked. - Again a parameter for deciding the popularity of
a page.