Information Filtering IR Topics - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

Information Filtering IR Topics

Description:

Searching isolated words in a list hash table) ... Longest match vs. simple rules. In longest match Removes the longest possible string of characters ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 65

Provided by: Tsvi

Category:

more less

Transcript and Presenter's Notes

Title: Information Filtering IR Topics

1
Information Filtering IR Topics

Tsvi Kuflik
Department of Information Systems Engineering,
Ben-Gurion University of the Negev
Beer-Sheva 84105, Israel
tsvikak_at_bgumail.bgu.ac.il
Based partially on the books of Frakes
Baetza-Yates Salton McGill.

2
IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction

Retrieved texts

Use and/or evaluation
modification
3
IR Topics

IR provides the basics for IF
Efficient document representation
Efficient query representation
Efficient matching methods

4
IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction

Retrieved texts

Use and/or evaluation
modification
5
IR Topics

Classical IR
Text oriented (originally there was text)
All manipulation is done on text
SMART - Salton, 1983

6
IR Topics

Information Retrieval Task
Retrieve documents relevant to user query from a
known collection
Expectation of short-term goal
Goal should be satisfied in real time
Goal will not persist after satisfied
Mature, successful methods have been developed
Most experience with short text documents in
static collections

7
IR Topics

Steps In Typical IR System
Document Preprocessing
Document Indexing
Query Processing
Retrieval of Relevant Documents
Presentation
Relevance Feedback

8
IR Topics

Preprocessing (Content based IF)
Parsing/Lexical analysis
Analyze text structural aspects
Isolate textual segments

9
IR Topics

Preprocessing (Content based IF)
Using meaningful terms only - dimensionality
reduction
Stop lists

10
IR Topics

Stop lists
Removal of meaningless terms
Inclusion of topic related terms only
Issues
Stop list content
Domain related
Phrases containing stop words
More..

11
IR Topics

Stop lists
Frequent words in English (single letters, again,
be, he, many, etc.)
Frequent words in the Database (such as computer
in a computer science DB) - the threshold for
frequency has to be defined.

12
Stop List Example (85 out of 429)

different,n,necessary,need,needed,needing,newest,n
ext,no,nobody,non,noone,not,noting,now,nowhere,of,
off,often,new,old,older,oldest,on,once,one,only,op
en,again,among,already,about,above,against,alone,a
fter,also,although,along,always,an,across,b,and,an
,other,ask,c,asking,asks,backed,away,a,should,show
,came,all,almost,before,began,backbacking,be,becam
e,because,becomes,been,at,behind,being,best,better
,between,big,showed,ended,ending,both,but,by,asked
,backs,can,cannot,number,numbers..

13
IR Topics

Stop list can be implemented by
Filtering stop words as part of lexical analysis
identifying and removing words from lexical
analyzer output
Searching isolated words in a list hash table)
Filtering stop words as part of the lexical
analysis process
Finite state automata

14
IR Topics

Stop lists FSA

15
IR Topics

Preprocessing (Content based IF)
Using meaningful terms only - dimensionality
reduction
Stemmers

16
IR Topics

Stemmers
Affix removal, mainly suffixes - Porter
Table lookup
Successor variety
N-grams
Issues
Side effects
Stem Word ?
Ambiguity

17
IR Topics

A table of all index terms and their stems

18
IR Topics

Disadvantages
A stemmer table for English does not exist.
What about other languages?
Storage overhead.
Advantage
Easy to implement(?), efficient (search time)
Could work well for static collections.

19
IR Topics

Successor Variety
A successor variety of a string is the number of
different characters that follow it in words in
the same body of text.
Example
Able, axle, accident, ape, about
Successor variety for apple
For a 4 (followed by b, x, c, p)
For ap 1 (followed only by e)

20
IR Topics

Successor Variety
Implementation method examples
Complete word a break is made after a segment if
that segment is a complete word in the corpus.
Peak and Plateau a segment break is made after a
character whose successor variety exceeds that of
a character immediately preceding it and the
character immediately following it

21
IR Topics

Successor Variety
Example
Test word readable
Corpus Able, Ape, Beatable, Fixable, Read,
Readable, Reading, Reads, Red, Rope, Ripe
Prefix successor variety letters
r 3 e,i,o
re 2 a,d
rea 1 d
read 3 a,I,s
reada 1 b

22
IR Topics

Successor Variety
Example
By both methods readable will be segmented to
read and able
Which will be selected?
If the first segment appears in less then 12
words in the corpus, select it, otherwise select
the second
This is due to an observation that frequent
segments may be prexfixes

23
IR Topics

N-grams the shared digram method
Terms are broken to n consecutive letters (n2,
pairs of letters)
Association measures are calculated between pairs
of terms, based on shared unique diagrams.
Example
statistics st ta at ti is st ti ic cs
unique digrams ta at cs ic is st ti
statistical st ta at ti is st ti ic ca al
unique digrams al at ca ic is st ta ti
6 shared digrams

24
IR Topics

Similarity measure S 2C/AB
A is unique digrams of first word
B is unique digrams of second method
C is the nu,ber of shared digrams
Similar words are grouped together, represented
by the shared digrams

25
IR Topics

Algorithms to remove suffixes and/or prefixes
form letters leaving a stem
Example of rules
if a word ends in ies but not eies or aies
then ies -gt y (studies-gt study)
if a word ends in es but not aes, ees or
oes
then es-gt e (tables-gt table, referees
not-gtrefere)

26
IR Topics

Longest match vs. simple rules
In longest match Removes the longest possible
string of characters
Porters Algorithm - uses a suffix list for
suffix stripping.

27
IR Topics

Overstemming - e.g. readable -gt red
Understemming e.g. users -gt use
Accuracy- e.g. skies -gt sky not ski. A special
rule for k in plurals in needed.

28
IR Topics

Stemming summary
May have positive effect on retrieval performance
Will not degrade performance
May reduce the size of document representation
and indices
Increase recall at the cost of precision decrease
(what the heck is he talking about???)

29
IR Topics

Preprocessing (Content based IF)
Using meaningful terms only - dimensionality
reduction
Dictionaries (Thesaurus, Ontology)

30
IR Topics

Dictionary/Thesaurus/Ontology
Topic related terms
Linguistic correctness of results
Issues
Ambiguity
Context related

31
IR Topics

Thesauri
Term relationships
Equivalence
Hierarchical
Non hierarchical
Specificity of Vocabulary
Manual/Automatic construction
Based on collections of documents
Merging existing Thesauri

32
IR Topics

IR classical models
Boolean
Vector space
Probabilistic
Model should provide
Document and queries representation
Matching techniques / similarity measure

33
IR Topics

Document Representation - Boolean
Boolean Model
Based on mutual occurrence of terms in documents
and queries
If sim(dj,q)1 then the Boolean model predicts
that the document dj is relevant to the query q
(it might not be). Otherwise, the prediction is
that the document is not relevant.

34
IR Topics

Document Representation - Boolean
Boolean Model
Clean formal definition.
Boolean operators AND, OR, NOT
Simple implementation
Intuitive

35
IR Topics

Document Representation - Boolean
Boolean Model
Very rigid AND means all OR means any.
Difficult to express complex user requests.
Difficult to control the number of documents
retrieved.
All matched documents will be returned.
Difficult to rank output.
All matched documents logically satisfy the
query.

36
IR Topics

Document Representation - Statistical
A document is typically represented by a bag of
words (unordered words with frequencies).

37
IR Topics

Bag set that allows multiple occurrences of the
same element.
User specifies a set of desired terms with
optional weights
Weighted query terms
Q lt database 0.5 text 0.8 information
0.2 gt
Unweighted query terms
Q lt database text information gt
No Boolean conditions specified in the query.

38
IR Topics

Document Representation - Statistical
Retrieval based on similarity between query and
documents.
Output documents are ranked according to
similarity to query.
Similarity based on occurrence frequencies of
keywords in query and document.

39
IR Topics

Document Representation
Vector Space Model
Vector of terms (weighted or not)
Linear Algebra
Vector similarity implies document similarity
Issues
Document size
Vector size
Independence
Multimedia

40
IR Topics

Document Representation
Vector Space Model Issues
Document size
Vector size
Independence
Multimedia

41
IR Topics

Document Representation - Statistical
Vector Space Model
How to determine important words in a document?
Word sense?
How to determine the degree of importance of a
term within a document and within the entire
collection?
How to determine the degree of similarity between
a document and the query?

42
IR Topics

Term independence false assumption
Are terms independent ?
LSI

43
IR Topics

Document Representation (cont)
Boolean
Term present/absent
TF
Term frequency relevancy, importance of it
DF
Term usage across documents discrimination
TFIDF
Combination of the above

44
IR Topics

TFIDF
TF normalization
length
max frequency
IDF
calculation

45
IR Topics

TFIDF Example
Given a document containing terms with
frequencies
A(3), B(2), C(1)
Assume a collection contains 10,000 documents and
document frequencies of these terms are
A(50), B(1300), C(250)
Then
A tf 3/3 idf log(10000/50) 5.3
tf-idf 5.3
B tf 2/3 idf log(10000/1300) 2.0
tf-idf 1.3
C tf 1/3 idf log(10000/250) 3.7
tf-idf 1.2

46
IR Topics

Similarity between vectors for the document di
and query q can be computed as the vector inner
product
sim(dj,q) djq wij wiq
where wij is the weight of term i in document
j and wiq is the weight of term i in the query
For binary vectors, the inner product is the
number of matched query terms in the document
(size of intersection).
For weighted term vectors, it is the sum of the
products of the weights of the matched terms.

47
IR Topics

Similarity between vectors for the document di
and query q can be computed as Cosine similarity
measures the cosine of the angle between two
vectors.

CosSim(dj, q)
48
IR Topics

Similarity between vectors for the document di
and query q can be computed as Auclidian distance
between the vectors
Many more techniques

49
IR Topics

Document Representation
Probabilistic
Binary Independence Retrieval Model
Tt1, tn, set of terms in the collection
qk, query
dm document
Binary Independence Retrieval Model
Assign weights to query terms appearing in a
document
?BIR(qk,dk)

50
IR Topics

Document Representation
Probabilistic (cont)
Document D is composed of a set of index terms
ti.
We will use them to represent document
Index term weights are all binary
Index terms can appear in relevant documents as
well as in non-relevant documents, so we have two
probabilities for every term

51
IR Topics

Document Representation
Probabilistic
Term weight is based on its frequency in the
corpus in relevant documents vs. non-relevant
documents, so each term has two values
If these two values are known then the
probability the relevancy of a new document can
be calculated based on these values

52
IR Topics

Document Representation
Probabilistic (cont)
Determines the probability that a document is
relevant to a specific query.
How do we determine if a given document Dj is
relevant to query Qi ?
Let us use Bayes Theorem
Considering odds

53
IR Topics

Document Representation
Probabilistic (cont)
Document D is composed of a set of index terms
ti.
We will use them to represent document
Index term weights are all binary
The odds that a document is relevant are

54
IR Topics

Document Representation
Probabilistic (cont)
Split according to presence / absence of index
terms

55
IR Topics

Document Representation
Probabilistic (cont)
Prob. that ti occurs in arbitrary relevant
document
Prob. that ti occurs in arbitrary relevant
document

56
IR Topics

Document Representation
Probabilistic (cont)
Assume that
Than
Only first product varies for different documents
wrt to qk

57
IR Topics

Document Representation
Probabilistic (cont)
Use logarithm
Retrieval function
?BIR(qk,dk)

58
IR Topics

Query Representation
Boolean
Statistic
TFIDF (actually IDF alone)
Stemming (optional)
Expansion (optional)
Issues
Small number of terms
Exact meaning (context, expansion)

59
IR Topics

Similarity
Distance
Cosine
Euclidean distance
Dot product
...
Probabilistic measures
Thresholds

60
IR Topics

Presentation
Results Ordering
similarity
Decreased similarity order
Presentation
Top n
User requested number
First n

61
IR Topics

Relevance
Relevance is a subjective judgment and may
include
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).

62
IR Topics

Relevance
Subjective
Measurable
Ambiguous
Helpful

63
IR Topics

Discussion
Boolean model -weakest, no partial match
Vector space model is very popular, easy to
implement and expected to be better.
Probabilistic model has better theoretical
background (?), considered better than the
previous two (?)
Independence assumption is wrong, both in
probabilistic model and the vector space model

64
IR Topics

Content based IF (IR based)
User information needs are defined by a set of
(possibly weighted) terms (vector
space/probabilistic).
Data-items (e.g. documents) are represented in a
similar way.
User needs and data-item representations
(vectors) are matched/correlated.

Write a Comment

User Comments (0)