Title: Information Filtering IR Topics
1Information Filtering IR Topics
- Tsvi Kuflik
- Department of Information Systems Engineering,
- Ben-Gurion University of the Negev
- Beer-Sheva 84105, Israel
- tsvikak_at_bgumail.bgu.ac.il
- Based partially on the books of Frakes
Baetza-Yates Salton McGill.
2IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction
Use and/or evaluation
modification
3IR Topics
- IR provides the basics for IF
- Efficient document representation
- Efficient query representation
- Efficient matching methods
4IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction
Use and/or evaluation
modification
5IR Topics
- Classical IR
- Text oriented (originally there was text)
- All manipulation is done on text
- SMART - Salton, 1983
6IR Topics
- Information Retrieval Task
- Retrieve documents relevant to user query from a
known collection - Expectation of short-term goal
- Goal should be satisfied in real time
- Goal will not persist after satisfied
- Mature, successful methods have been developed
- Most experience with short text documents in
static collections
7IR Topics
- Steps In Typical IR System
- Document Preprocessing
- Document Indexing
- Query Processing
- Retrieval of Relevant Documents
- Presentation
- Relevance Feedback
8IR Topics
- Preprocessing (Content based IF)
- Parsing/Lexical analysis
- Analyze text structural aspects
- Isolate textual segments
9IR Topics
- Preprocessing (Content based IF)
- Using meaningful terms only - dimensionality
reduction - Stop lists
10IR Topics
- Stop lists
- Removal of meaningless terms
- Inclusion of topic related terms only
- Issues
- Stop list content
- Domain related
- Phrases containing stop words
- More..
11IR Topics
- Stop lists
- Frequent words in English (single letters, again,
be, he, many, etc.) - Frequent words in the Database (such as computer
in a computer science DB) - the threshold for
frequency has to be defined.
12Stop List Example (85 out of 429)
- different,n,necessary,need,needed,needing,newest,n
ext,no,nobody,non,noone,not,noting,now,nowhere,of,
off,often,new,old,older,oldest,on,once,one,only,op
en,again,among,already,about,above,against,alone,a
fter,also,although,along,always,an,across,b,and,an
,other,ask,c,asking,asks,backed,away,a,should,show
,came,all,almost,before,began,backbacking,be,becam
e,because,becomes,been,at,behind,being,best,better
,between,big,showed,ended,ending,both,but,by,asked
,backs,can,cannot,number,numbers..
13IR Topics
- Stop list can be implemented by
- Filtering stop words as part of lexical analysis
- identifying and removing words from lexical
analyzer output - Searching isolated words in a list hash table)
- Filtering stop words as part of the lexical
analysis process - Finite state automata
14IR Topics
15IR Topics
- Preprocessing (Content based IF)
- Using meaningful terms only - dimensionality
reduction - Stemmers
16IR Topics
- Stemmers
- Affix removal, mainly suffixes - Porter
- Table lookup
- Successor variety
- N-grams
- Issues
- Side effects
- Stem Word ?
- Ambiguity
17IR Topics
- A table of all index terms and their stems
18IR Topics
- Disadvantages
- A stemmer table for English does not exist.
- What about other languages?
- Storage overhead.
- Advantage
- Easy to implement(?), efficient (search time)
- Could work well for static collections.
19IR Topics
- Successor Variety
- A successor variety of a string is the number of
different characters that follow it in words in
the same body of text. - Example
- Able, axle, accident, ape, about
- Successor variety for apple
- For a 4 (followed by b, x, c, p)
- For ap 1 (followed only by e)
20IR Topics
- Successor Variety
- Implementation method examples
- Complete word a break is made after a segment if
that segment is a complete word in the corpus. - Peak and Plateau a segment break is made after a
character whose successor variety exceeds that of
a character immediately preceding it and the
character immediately following it
21IR Topics
- Successor Variety
- Example
- Test word readable
- Corpus Able, Ape, Beatable, Fixable, Read,
Readable, Reading, Reads, Red, Rope, Ripe - Prefix successor variety letters
- r 3 e,i,o
- re 2 a,d
- rea 1 d
- read 3 a,I,s
- reada 1 b
22IR Topics
- Successor Variety
- Example
- By both methods readable will be segmented to
read and able - Which will be selected?
- If the first segment appears in less then 12
words in the corpus, select it, otherwise select
the second - This is due to an observation that frequent
segments may be prexfixes
23IR Topics
- N-grams the shared digram method
- Terms are broken to n consecutive letters (n2,
pairs of letters) - Association measures are calculated between pairs
of terms, based on shared unique diagrams. - Example
- statistics st ta at ti is st ti ic cs
- unique digrams ta at cs ic is st ti
- statistical st ta at ti is st ti ic ca al
- unique digrams al at ca ic is st ta ti
- 6 shared digrams
24IR Topics
- Similarity measure S 2C/AB
- A is unique digrams of first word
- B is unique digrams of second method
- C is the nu,ber of shared digrams
- Similar words are grouped together, represented
by the shared digrams
25IR Topics
- Algorithms to remove suffixes and/or prefixes
form letters leaving a stem - Example of rules
- if a word ends in ies but not eies or aies
- then ies -gt y (studies-gt study)
- if a word ends in es but not aes, ees or
oes - then es-gt e (tables-gt table, referees
not-gtrefere)
26IR Topics
- Longest match vs. simple rules
- In longest match Removes the longest possible
string of characters - Porters Algorithm - uses a suffix list for
suffix stripping.
27IR Topics
- Overstemming - e.g. readable -gt red
- Understemming e.g. users -gt use
- Accuracy- e.g. skies -gt sky not ski. A special
rule for k in plurals in needed.
28IR Topics
- Stemming summary
- May have positive effect on retrieval performance
- Will not degrade performance
- May reduce the size of document representation
and indices - Increase recall at the cost of precision decrease
(what the heck is he talking about???)
29IR Topics
- Preprocessing (Content based IF)
- Using meaningful terms only - dimensionality
reduction - Dictionaries (Thesaurus, Ontology)
30IR Topics
- Dictionary/Thesaurus/Ontology
- Topic related terms
- Linguistic correctness of results
- Issues
- Ambiguity
- Context related
31IR Topics
- Thesauri
- Term relationships
- Equivalence
- Hierarchical
- Non hierarchical
- Specificity of Vocabulary
- Manual/Automatic construction
- Based on collections of documents
- Merging existing Thesauri
32IR Topics
- IR classical models
- Boolean
- Vector space
- Probabilistic
- Model should provide
- Document and queries representation
- Matching techniques / similarity measure
33IR Topics
- Document Representation - Boolean
- Boolean Model
- Based on mutual occurrence of terms in documents
and queries - If sim(dj,q)1 then the Boolean model predicts
that the document dj is relevant to the query q
(it might not be). Otherwise, the prediction is
that the document is not relevant.
34IR Topics
- Document Representation - Boolean
- Boolean Model
- Clean formal definition.
- Boolean operators AND, OR, NOT
- Simple implementation
- Intuitive
35IR Topics
- Document Representation - Boolean
- Boolean Model
- Very rigid AND means all OR means any.
- Difficult to express complex user requests.
- Difficult to control the number of documents
retrieved. - All matched documents will be returned.
- Difficult to rank output.
- All matched documents logically satisfy the
query.
36IR Topics
- Document Representation - Statistical
- A document is typically represented by a bag of
words (unordered words with frequencies).
37IR Topics
- Bag set that allows multiple occurrences of the
same element. - User specifies a set of desired terms with
optional weights - Weighted query terms
- Q lt database 0.5 text 0.8 information
0.2 gt - Unweighted query terms
- Q lt database text information gt
- No Boolean conditions specified in the query.
38IR Topics
- Document Representation - Statistical
- Retrieval based on similarity between query and
documents. - Output documents are ranked according to
similarity to query. - Similarity based on occurrence frequencies of
keywords in query and document.
39IR Topics
- Document Representation
- Vector Space Model
- Vector of terms (weighted or not)
- Linear Algebra
- Vector similarity implies document similarity
- Issues
- Document size
- Vector size
- Independence
- Multimedia
40IR Topics
- Document Representation
- Vector Space Model Issues
- Document size
- Vector size
- Independence
- Multimedia
41IR Topics
- Document Representation - Statistical
- Vector Space Model
- How to determine important words in a document?
- Word sense?
- How to determine the degree of importance of a
term within a document and within the entire
collection? - How to determine the degree of similarity between
a document and the query?
42IR Topics
- Term independence false assumption
- Are terms independent ?
- LSI
43IR Topics
- Document Representation (cont)
- Boolean
- Term present/absent
- TF
- Term frequency relevancy, importance of it
- DF
- Term usage across documents discrimination
- TFIDF
- Combination of the above
44IR Topics
- TFIDF
- TF normalization
- length
- max frequency
- IDF
- calculation
45IR Topics
- TFIDF Example
- Given a document containing terms with
frequencies - A(3), B(2), C(1)
- Assume a collection contains 10,000 documents and
- document frequencies of these terms are
- A(50), B(1300), C(250)
- Then
- A tf 3/3 idf log(10000/50) 5.3
tf-idf 5.3 - B tf 2/3 idf log(10000/1300) 2.0
tf-idf 1.3 - C tf 1/3 idf log(10000/250) 3.7
tf-idf 1.2
46IR Topics
- Similarity between vectors for the document di
and query q can be computed as the vector inner
product - sim(dj,q) djq wij wiq
- where wij is the weight of term i in document
j and wiq is the weight of term i in the query - For binary vectors, the inner product is the
number of matched query terms in the document
(size of intersection). - For weighted term vectors, it is the sum of the
products of the weights of the matched terms.
47IR Topics
- Similarity between vectors for the document di
and query q can be computed as Cosine similarity
measures the cosine of the angle between two
vectors.
CosSim(dj, q)
48IR Topics
- Similarity between vectors for the document di
and query q can be computed as Auclidian distance
between the vectors - Many more techniques
49IR Topics
- Document Representation
- Probabilistic
- Binary Independence Retrieval Model
- Tt1, tn, set of terms in the collection
- qk, query
- dm document
- Binary Independence Retrieval Model
- Assign weights to query terms appearing in a
document - ?BIR(qk,dk)
50IR Topics
- Document Representation
- Probabilistic (cont)
- Document D is composed of a set of index terms
ti. - We will use them to represent document
- Index term weights are all binary
- Index terms can appear in relevant documents as
well as in non-relevant documents, so we have two
probabilities for every term
51IR Topics
- Document Representation
- Probabilistic
- Term weight is based on its frequency in the
corpus in relevant documents vs. non-relevant
documents, so each term has two values - If these two values are known then the
probability the relevancy of a new document can
be calculated based on these values
52IR Topics
- Document Representation
- Probabilistic (cont)
- Determines the probability that a document is
relevant to a specific query. - How do we determine if a given document Dj is
relevant to query Qi ? - Let us use Bayes Theorem
- Considering odds
53IR Topics
- Document Representation
- Probabilistic (cont)
- Document D is composed of a set of index terms
ti. - We will use them to represent document
- Index term weights are all binary
- The odds that a document is relevant are
54IR Topics
- Document Representation
- Probabilistic (cont)
- Split according to presence / absence of index
terms
55IR Topics
- Document Representation
- Probabilistic (cont)
- Prob. that ti occurs in arbitrary relevant
document - Prob. that ti occurs in arbitrary relevant
document
56IR Topics
- Document Representation
- Probabilistic (cont)
- Assume that
- Than
- Only first product varies for different documents
wrt to qk
57IR Topics
- Document Representation
- Probabilistic (cont)
- Use logarithm
- Retrieval function
- ?BIR(qk,dk)
58IR Topics
- Query Representation
- Boolean
- Statistic
- TFIDF (actually IDF alone)
- Stemming (optional)
- Expansion (optional)
- Issues
- Small number of terms
- Exact meaning (context, expansion)
59IR Topics
- Similarity
- Distance
- Cosine
- Euclidean distance
- Dot product
- ...
- Probabilistic measures
- Thresholds
60IR Topics
- Presentation
- Results Ordering
- similarity
- Decreased similarity order
- Presentation
- Top n
- User requested number
- First n
61IR Topics
- Relevance
- Relevance is a subjective judgment and may
include - Being on the proper subject.
- Being timely (recent information).
- Being authoritative (from a trusted source).
- Satisfying the goals of the user and his/her
intended use of the information (information
need).
62IR Topics
- Relevance
- Subjective
- Measurable
- Ambiguous
- Helpful
63IR Topics
- Discussion
- Boolean model -weakest, no partial match
- Vector space model is very popular, easy to
implement and expected to be better. - Probabilistic model has better theoretical
background (?), considered better than the
previous two (?) - Independence assumption is wrong, both in
probabilistic model and the vector space model
64IR Topics
- Content based IF (IR based)
- User information needs are defined by a set of
(possibly weighted) terms (vector
space/probabilistic). - Data-items (e.g. documents) are represented in a
similar way. - User needs and data-item representations
(vectors) are matched/correlated.