Title: IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS
1IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS
VIA RDBMS
2Relational Database Definitions
- Relational database a set of relations
- Relation made up of 2 parts
- Instance a table, with rows and columns. Rows
cardinality, fields degree / arity. - Schema specifies name of relation, plus name
and type of each column. - E.G. Students(sid string, name string, login
string, age integer, gpa
real). - Can think of a relation as a set of rows or
tuples (i.e., all rows are distinct).
3Example Instance of Students Relation
Cardinality 3, degree 5, all rows distinct
4Relational Query Languages
- A major strength of the relational model
supports simple, powerful querying of data. - Queries can be written intuitively, and the DBMS
is responsible for efficient evaluation.
5The SQL Query Language
- Developed by IBM (system R) in the 1970s
- Need for a standard since it is used by many
vendors - Standards
- SQL-86
- SQL-89 (minor revision)
- SQL-92 (major revision, current standard)
- SQL-99 (major extensions)
6The SQL Query Language
- To find all 18 year old students, we can write
SELECT FROM Students S WHERE S.age18
- To find just names and logins, replace the first
line
SELECT S.name, S.login
7 Querying Multiple Relations
SELECT S.name, E.cid FROM Students S, Enrolled
E WHERE S.sidE.sid AND E.gradeA
8Creating Relations in SQL
- Creates the Students relation. Observe that the
type (domain) of each field is
specified, and enforced by the DBMS whenever
tuples are added or modified. - As another example, the Enrolled table holds
information about courses that students take.
CREATE TABLE Students (sid CHAR(20), name
CHAR(20), login CHAR(10), age INTEGER,
gpa REAL)
CREATE TABLE Enrolled (sid CHAR(20), cid
CHAR(20), grade CHAR(2))
9Combining Separate Systems
- Use an IR and RDBMS systems which are
independent. - Divide the query into two
- Structured part for the RDBMS
- Unstructured (text) part for the IR
- Combine the results from IR and RDBMS
- Good for letting each vendor develop its own
system - Bad for data integrity, recovery, portability,
and performance
10User Defined Operators
- Allow users to modify SQL by adding their own
functions - Some vendors used this approach (such as IBM DB2
text extender) - Lynch and Stonebreaker defined user defined
operators to implement information retrieval in
1988
//Retrieves documents that contain term1, term2,
term3 SELECT Doc_Id FROM Doc WHERE
SEARCH-TERM(Text, Term1, Term 2, Term3)
//Retrieves documents that contain term1, term2,
term3 // within a window of 5 terms SELECT
Doc_Id FROM Doc WHERE PROXIMITY(Text,5, Term1,
Term 2, Term3)
11Non-First Normal Form Approaches
- Capture the many-to-many relationships into sets
via nested relations - Hard to implement ad-hoc queries
- No standard yet
12Using RDBMS for IR
- Benefits
- Recovery
- Performance
- Data migration
- Concurrency Control
- Access control mechanism
- Logical and physical data independence
13Using RDBMS for IR
- Example A bibliography that includes both
structured and unstructured information - DIRECTORY (name, institution) affiliation of
the author - AUTHOR(name,DocId) authorship information
- INDEX (name, DocId) terms that are used to index
a document
14Using RDBMS for IR
- Preprocessing
- SGML can be used as a starting point which is a
standard for defining parts of documents
ltDOCgt ltDOCNOgt WSJ834234234 lt/DOCNOgt ltHLgt How to
make students suffer in IR Course lt/HLgt ltDDgt
03/23/87lt/DDgt ltDATELINEgt Sabanci, Turkey
lt/DATELINEgt ltTEXTgt Crawler HW, Inverted Index,
Querying lt/TEXTgt lt/DOCgt
15Using RDBMS for IR
- Preprocessing
- SGML can be used as a starting point which is a
standard for defining parts of documents - Use a parser together with a hash function to
identify terms - Use STOP_TERM table for referencing stop words
- Produce three output tables
- INDEX (DocId, Term, TermFrequency) Models the
inverted index - DOC (DocId, DocName, PubDate, DateLine)
Document metadata - TERM (Term, Idf) stored the weights of each
term
//Construct TERM table, N is the total number of
documents INSERT INTO TERM SELECT
Term,log(N/Count()) FROM INDEX GROUP BY Term
16Using RDBMS for IR
An offset can be added together with the term to
be able to answer proximity queries. For example
Vice President should occur together in the
same document for relevant documents
etc. INDEX_PROX (DocId, Term, OffSet)
//Construct TERM table, N is the total number of
documents INSERT INTO INDEX SELECT DocId, Term,
COUNT() FROM INDEX_PROX GROUP BY DocId, Term
17Using RDBMS for IR
- Query can be modeled as a relation as well when
it is a long document - QUERY(Term,TermFreq)
- Ex Find all news documents written on
03/03/2005 about Sabanci University - Data will be extracted from the structured fields
- Terms will be extracted using the inverted index
-
SELECT d.DocId FROM DOC d, INDEX i WHERE i.Term
IN (Sabanci, University) AND d.PubDate
03/03/2005 AND d.DocId i.DocId
18Using RDBMS for IR
- Boolean Queries Consists of terms with boolean
operators (AND, OR, and NOT) - For a single inputTerm retrieve the document
texts that contain that term
SELECT d.Text FROM DOC d, WHERE d.DocId IN
(SELECT DISTINCT (i.DocId) FROM
INDEX i WHERE i.Term
inputTerm) Note that we can store the text part
of a document using BLOB or CLOG ( Binary or
Character Large Object)
19Using RDBMS for IR
- Boolean Queries that contain OR
SELECT DISTINCT (i.DocId) FROM INDEX i WHERE
i.Term inputTerm1 OR i.Term
inputTerm2 OR .. i.Term
inputTermn OR
20Using RDBMS for IR
- Boolean Queries that contain AND
SELECT DISTINCT (i.DocId) FROM INDEX i WHERE
i.Term inputTerm1 AND i.Term
inputTerm2 AND .. i.Term
inputTermn AND ??
21Using RDBMS for IR
- Boolean Queries that contain AND (Previous Answer
Was Wrong)
SELECT DISTINCT (i.DocId) FROM INDEX i1, INDEX
i2, INDEX i3, . INDEX in WHERE i1.Term
inputTerm1 AND i2.Term inputTerm2
AND .. in.Term
inputTermn AND i1.DocID i2.DocId
AND i2.DocID i3.DocId AND
in-1 in.DocID OR YOU CAN USE
INTERSECTION
22Using RDBMS for IR
- Boolean Queries that contain AND
- Commercial DBMSs are not able to process more
than a fixed number of joins. - Solution
SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(i.Term) (SELECT COUNT() FROM
QUERY) Works only when the INDEX contains only
one occurrence of a given term Together with its
frequency. No Proximity is recorded.
23Using RDBMS for IR
- Boolean Queries that contain AND
- Commercial DBMSs are not able to process more
than a fixed number of joins. - Solution for terms appearing more than once in
the INDEX
SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(DISTINCT(i.Term)) (SELECT COUNT() FROM
QUERY) This is slower since DISTINC requires a
sort for duplicate elimination.
24Using RDBMS for IR
- Boolean Queries that contain AND
- Commercial DBMSs are not able to process more
than a fixed number of joins. - Implementation of TAND (Threshold AND) is also
simple
SELECT i.DocId FROM INDEX i, Query q WHERE
i.Term q.term GROUP BY i.DocId HAVING
COUNT(DISTINCT(i.Term)) gt k
25Using RDBMS for IR
- Proximity Queries for terms within a specific
window width
SELECT a.DocId FROM INDEX_PROX a, INDEX_PROX b
WHERE a.Term IN (SELECT q.Term FROM QUERY q)
AND b.Term IN (SELECT q.Term FROM
QUERY q) AND a.DocId b.DocId AND
(a.offset b.offset) BETWEEN 0 AND
(width-1) GROUP BY a.DocId, b.DocId, a.Term,
a.offset HAVING COUNT(DISTINCT(b.Term)) SELECT
(COUNT() FROM QUERY)
26Using RDBMS for IR
SELECT i.DocId, SUM(q.tft.idft.tft.idf) FROM
QUERY q, INDEX i, TERM t WHERE q.Term t.term
AND i.Term t.Term GROUP BY i.DocId ORDER BY 2
DESC