INFO624 Week 2 Models of Information Retrieval - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

INFO624 Week 2 Models of Information Retrieval

Description:

AOL PLS Search Engine (free) GreenStone Digital Library Software (open-source) ... mnoGoSearch (free) Apache Lucene (open source ... Weights in the Vector Space ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 41

Provided by: xia52

Category:

more less

Transcript and Presenter's Notes

Title: INFO624 Week 2 Models of Information Retrieval

1
INFO624 - Week 2Models of Information Retrieval

Dr. Xia Lin
Associate Professor
College of Information Science and Technology
Drexel University

2
Reviews of Last Week

Challenges of Information Retrieval
Translate users information needs to queries.
Match queries to stored information.
Evaluate if the query results match the users
information needs
Differences between
Data, information, and knowledge
Data retrieval and information retrieval

3
Assignment 1

Some of my favorite Search Software Packages
IBM Content Management (high-cost)
AOL PLS Search Engine (free)
GreenStone Digital Library Software (open-source)
SWISH (open source)
mnoGoSearch (free)
Apache Lucene (open source components)

4
Documents

Documents are logical units of text
Units of records (text other components)
Units that can be stored, retrieved, and
displayed as an unique entity
Units of semantic entity
units of text grouped together for a purpose
Units of unformatted text
Text as written by authors of documents.

5
Document Models

Documents need to be processed and represented in
a concise and identifiable formats/structures.
Documents are full of text.
Not every words of the text are meaningful for
searching/retrieval.
Documents themselves do not have identifiable
attributes such as authors and titles.

6
Figure 1.2 Logical view of a document from full
text to a set of index terms.
7
Document Representation

Documents should be represented to help users
identify and receive information from the system.
to identify authors and titles
to identify subjects
to provide summaries/abstracts
to classify subject categories

8
Document Surrogates

Each document should have one or more short and
descriptive labels/attributes
Level 1
Title
Author
Keywords
Level 2
Level 1 Abstract
Level 3
Level 2 full text

9
A Formal IR Models

An information retrieval model is a quadruple (D,
Q, F, R(qi, dj)) where
D is a set composed of logical views (or
representations) for the documents in the
collection.
Q is a set composed of logical views (or
representations) for the information needs. Such
representations are called queries.
F is a framework for modeling document
representations, queries, and their relationships
R(qi, dj) is a ranking function which associated
a real number with a queryqi and a document
representation dj. Scuh ranking defines an
ordering among the documents with regard to the
query qi.

10
Computerized Indexing

Title indexing
Sort all the titles alphabetically
Not consider the beginning a or the
Convert all letters to uppercases.
Matching always starts from the beginning of the
title (not individual words).
Most early IR systems (such as library catalogs)
used title indexing

11
Word indexing

Parsing every individual words from documents
First decision What is a word?
Are digits words?
How about the letter and digit combination B6,
B12
Is F-16 one word or two words?
Hyphens
Online, on-line, on line ?
F-16
Singular or plural ?
List all the words alphabetically with points
back to documents inverted indexing.

12
Inverted Indexing

Inverted indexing consists of an ordered list of
indexing terms, each indexing term is associated
with some document identification numbers.
Retrieval is done by first searching in the
ordered list to find the indexing term, then
using the document identification numbers to
locate documents

13
Example Create an inverted indexing for the
following
14
Boolean Logic

Logical operators defined on sets
True and false
A set is a collection of items with certain
common characteristics.
Any item either belongs to the set (true) or not
belong to the set (false)
AND
combine two sets, A and B, to create a smaller
(or at least not larger) set C.
any items in C must be in BOTH set A and set B.
OR
Union of two sets, A and B, to create a larger
set C.
any item in C must be either in set A or in set
B.
Not
to exclude items in a set.

15
Example

Given
A1, 3, 7, 12, 14, 25,36,
B1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26
C2,4,6,8,10,11,12,13,14
Derive
A AND B
A OR B
A AND B AND C
(A AND B) NOT C

(A AND B) OR C
(A OR B) AND C
A AND (B OR C)

16
Boolean Logic

Venn Diagram
graphical representation of Boolean logic
A and (B or C)
A and B or (C and D)

17
Boolean Query

Terms connected by Boolean operators
The system retrieves a set of documents based on
the Boolean logic of the query.
Examples
(network or networks or structured or system or
systems) and (information or retrieval)

18
Advantages of Boolean Search

Simple and specific
Effective
AND reduces the number of hits very quickly
OR expands search scope
Strong logic-based
proved mathematical foundations

19
Problems of Boolean Search

Boolean search is an exact search
either retrieving or not retrieving a document.
Requesting computer would not find computing
unless more programming is done
No weighting can be done on terms
in query, A and B, you cant specify A is more
important than B.

No Ranking
Retrieved sets can not be ordered based on the
Boolean logic.
Every retrieved document are treated equally.
Possible order confusion
A AND B OR C

21
Vectors

A numerical representation for a point in a
multi-dimensional space.
(x1, x2, xn)
Dimensions of the space need to be defined
A measure of the space needs to be defined.

22
Vector Representation of Document Space

Each indexing term is a dimension
Each document is a vector
Di (ti1, ti2, ti3, ti4, ... tin)
Dj (tj1, tj2, dj3, tj4, ..., tjn)
Document similarity is defined as

23
Example

A document Space is defined by three terms
hardware, software, user
A set of documents are defined as
A1(1, 0, 0), A2(0, 1, 0), A3(0, 0, 1)
A4(1, 1, 0), A5(1, 0, 1), A6(0, 1, 1)
A7(1, 1, 1) A8(1, 0, 1). A9(0, 1, 1)
If the Query is hardware and software
what documents should be retrieved?

In Boolean query matching
document A4, A7 will be retrieved by ANDing the
two query terms
retrievedA1, A2, A4, A5, A6, A7, A8, A9 if two
query terms are ORed together.
In Vector query matching
q(1, 1, 0)
S(q, A1)0.71, S(q, A2)0.71, S(q, A3)0
S(q, A4)1, S(q, A5)0.5, S(q, A6)0.5
S(q, A7)0.82, S(q, A8)0.5, S(q, A9)0.5
Document retrieved set (with order)
A4, A7, A1, A2, A5, A6, A8, A9

25
Weights in the Vector Space

A main advantage of Vector representation is that
items in vectors dont have to be just 0 or 1
(true or false).
A1(0.7, 0.5, 0.3)
A2(0.5, 0.2, 0.7)
A3(0.3, 0.6, 0.9)
A4(0.7, 0.9, 1.0)
Queries may also be weighted
Q(0.7, 0.3, 0)

26
TF and IDF

TF term frequency
number of times a term occurs in a document
DF Document frequency
Number of documents that contain the term.
IDF inversed document frequency
log(N/ni)
N the total number of documents
ni number of documents that contains term i.

27
Saltons Vector Space

A document is represented as a vector
(W1, W2, , Wn)
Binary
Wi 1 if the corresponding term is in the
document
Wi 0 if the term is not in the document
TF (Term Frequency)
Wi tfi where tfi is the number of times the
term occurred in the document
TFIDF (Inverse Document Frequency)
Wi tfiidfitfi(1log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.

In vector space, documents and queries are
treated the same.
It is easier to do similarity search
find documents like this one
It is easier to do document clusters
group documents into categories and
subcategories
Its easier to display search results graphically
Giving meaning to place or location in the
multi-dimensional space

29
Web Indexing

Most web indexing is Vector-based indexing, with
variances
robot indexing software keeps traverse the web to
collect more pages and terms
Servers establish a huge inverted indexing and
vector indexing database
Search engines conduct different types of vector
query matching
only a few search engines implement truly Boolean
query matching

The real differences among different search
engines are
their indexing weight schemes
their query process methods
their ranking algorithms
None of these are published by any of the search
engines firms.

31
Alternative IR Models

Probabilistic Model
Given a document d, how likely would the user
consider it relevant?
How likely would the user consider it no
relevant?
If these two are known, Similarity of document d
and query q can be defined as
S(d, q) probability of d is relevant to q
probability of d is not relevant to q

32
Examples

If a document is 80 likely to be relevant to
query q, what is its (probabilistic) similarity?
If a document is only 30 likely to be relevant,
what is the similarity?

If there are 100 documents, 10 are relevant to a
query,
what is the probability of relevance for a
randomly select document?
What is the similarity of this document to the
query?
Any retrieve systems must do must better than
that.
In general, retrieval systems should retrieve
those Sgt1

Advantages of the Probabilistic model
Documents can be ranked by its relevance
probability.
Relevance probability can be improved through the
interaction process.
Good mathematic model
Disadvantages
Involved many assumptions
Not very practical

35
Fuzzy Set Model

Fuzzy Set Theory
Extension of Boolean set theory
Instead of a binary membership definition, fuzzy
set Membership is continuously defined between 0
and 1.
Example
Male students in our class
tall students in our class
One is Boolean set and one is fuzzy set.

The set of retrieved documents should be
considered as a fuzzy set.
Documents are not just relevant or not-relevant.
Documents can be somehow relevant.
Documents can be 80 likely to be relevant.
Good Mathematical Models but not widely
implemented and tested.

37
Latent Semantic Indexing Model

Map documents from a high-dimensional space to a
lower dimensional space, while maintaining
document relationships.
For clustering
For visualization
Its a popular advanced retrieval technique.
Its computationally expensive.

38
Neural Network Model

Organize the document collection as a semantic
network through learning
Use known queries/relevant documents to to train
the network, and later allow the network to
predict relevance for new queries. (supervised
learning)
Use document-document relationships to
self-organize the network and move relevant
documents close to each other. (un-supervised
learning).

39
The Fusion Model

Retrieve documents based on text indexing
(Boolean model or Vector Space Model, etc.)
Retrieve documents based on link models
(Citations, Googles PageLink, etc.)\
Retrieve documents based on classification models
(The classification schemes, thesauri, Yahoo
categories, etc).
Fusion results together before response to the
user

40
Models for Browsing