Text-retrieval%20Systems

About This Presentation

Title:

Text-retrieval%20Systems

Description:

www.ms.mff.cuni.cz – PowerPoint PPT presentation

Number of Views:187

Avg rating:3.0/5.0

Slides: 532

Provided by: Michal189

Category:

more less

Transcript and Presenter's Notes

Title: Text-retrieval%20Systems

1
Text-retrievalSystems

NDBI010 Lecture Slides
KSI MFF UK
http//www.ms.mff.cuni.cz/kopecky/teaching/ndbi01
0/
Version 10.05.12.13.30.en

2
Literature (textbooks)

Introduction to Information Retrieval
Christopher D. Manning, Prabhakar Raghavanand
Hinrich Schütze
Cambridge University Press, 2008
http//informationretrieval.org/
Dokumentografické informacní systémy
Pokorný J., Snášel V., Kopecký M.
Nakladatelství Karolinum, UK Praha, 2005
Pokorný J., Snášel V., Húsek D.
Nakladatelství Karolinum, UK Praha, 1998
Textové informacní systémy
Melichar B.
Vydavatelství CVUT, Praha, 1997

3
Further links (books)

Computer Algorithms - String Pattern Matching
Strategies,
Jun Ichi Aoe,
IEEE Computer Society Press 1994
Concept Decomposition for Large Sparse Text Data
using Clustering
Inderjit S. Dhillon, Dharmendra S. Modha
IBM Almaden Research Center, 1999

4
Further links (articles)

The IGrid Index Reversing the Dimensionality
Curse For Similarity Indexing in High Dimensional
Space for Large Sparse Text Data using
Clustering
Charu C. Aggrawal, Philip S. Yu
IBM T. J. Watson Research Center
The Pyramid Technique Towards Breaking the Curse
of Dimensionality
S. Berchtold, C. Böhm, H.-P. Kriegel
ACM SIGMOD Conference Proceedings, 1998

5
Further links (articles)

Affinity Rank A New Scheme for Efficient Web
Search
Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu,
Wei-Ying Ma
2004
Improving Web Search Results Using Affinity Graph
Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi,
Weiguo Fan, Zheng Chen1, Wei-Ying Ma
Efficient computation of pagerank
T.H. Haveliwala
Technical report, Stanford University, 1999

6
Further links (older)

Introduction to Modern Information Retrieval
Salton G., McGill M. J.
McGRAW-Hill 1981
Výber informací v textových bázích dat
Pokorný J.
OVC CVUT Praha 1989

7
Introduction

Overview of the problem informativeness
measurement

8
Retrieval system origin

50th of 20th century
The gradual automation of the procedures used in
libraries
Now a separate subsection of ISs
Factual IS
Processing of information having defined internal
structure (usually in the form of tables)
Bibliographic IS
Processing of information in form of the text
written in natural language without strict
internal structure.

9
Interaction with TRS

Query formulation
Comparison
Hit-list obtaining
Query tuning/reformulation
Document request
Document obtaining

10
TRS Structure

Document disclosure system
Returns secondary information
Author
Title
...
Document delivery system
Need not to be supported by the SW

I)
1
2
3
4
II)
5
6
11
Query Evaluation

Direct comparison is time-consuming

12
Query Evaluation

Document model is used to compare
Lossy process,usually based on presence of
words in documents
Produces structured data suitable for effective
comparison

13
Query Evaluation

Query is processed to obtain needed form
Processed queryis compared against the index

14
Text preprocessing

Searching is more effective using created
(structured) model of documents, but it can use
only information stored in the model, not in
documents.
The goal is to create model, preserving as much
information form the original documents as
possible.
Problem lot of ambiguity in text.
Still exist many not resolved tasks concerning
document understanding.

15
Text understanding

Writer
Text sequence of words in natural language.
Each word stands for some idea/imagination of
writer.
Ideas represent real subject, activity, etc.
Reader folows (not necessary exactly the same
mappings) from left to right

...
16
Text understanding

Synonymy of words
More words can have the same meaning for the
writer
car automobile
sick ill

...
17
Text understanding

Homonymy of words
One word can have more than one different
meanings
fluke fish, anchor,
crown currency, treetop, jewel,
class year of studies, kategory in set theory,

...
18
Text understanding

Word meanings need not be exactly the same.
Hierarchical overlapping
animal gt horse gt Stallion
Associativity among meanings
calculator computer processor

...
19
Text understanding

Mapping between subjects, ideas and words can
depend on individual persons readers and
writers.
Two people can assign partly or completely
different meaning to given term.
Two people can imagine different thing for the
same word.
mother, room, ...
In result, by reading the same text two different
readers can obtain different information
Each from other
In comparison with authors intention

20
Text understanding

Homonymy and ambiguities grows with transition
form words/terms to sentences and bigger parts of
the text.
Example of English sentence with more
grammatically correct meanings (in this case a
human reader probably eliminates the nonsense
meaning)
See Podivné fungování gramatiky,http//www.scienc
eworld.cz/sw.nsf/lingvistika
In the sentence Time flies like an arrow either
flies (fly) or like can be chosen for the
predicate, what produces two significantly
different meanings.

21
Text preprocessing

Inclusion of linguistic analysis into the text
processing can partially solve the problem
Disambiguation
Selection of correct meaning of the term in the
sentence
According to grammar (Verb versus Noun etc.)
According to context (more complicated, can
distinguish between two Verbs, two Nouns, etc).

22
Text preprocessing

Inclusion of linguistic analysis into the text
processing can partially solve the problem
Lemmatization
For each term/word in the text after its proper
meaning is found assigns
Type of word, plural vs. singular, present time
vs. preterite, etc.
Base form (singular for Nouns, infinitive for
Verbs, )
Information obtained by sentence analysis
(subject, predicate, object, ...)

23
Text preprocessing

Other options, that can be more or less solved
are
Identification of collocations
World war two, ...
Assigning of Nouns for Pronouns, used in the text
(very complex and hard to solve, sometimes even
for human reader)

24
Precision and Recall

As a result of ambiguities there exists no
optimal text retrieval system
After the answer of the query is obtained,
following values can be evaluated
Number of returned documents in the list Nv
The system supposed them to be relevant useful
according to their math with the query
Number of returned relevant documents Nvr
The questioner find them to be really relevant as
they fulfill its information needs
Number of all relevant documents in the system
Nr
Very hard to guess for large and unknown
collections

25
Precision and Recall

Two TRSs can (and do) return two different
result for the same query, that can be partly or
completely unique.
How to compare quality of those systems?

26
Precision and Recall

Two questioners can suppose another documents to
be relevant for their equally formulated query
How to meet both subjective expectations of
questioners?

27
Precision and Recall

Quality of result set of documents is usually
evaluated according to numbers Nv, Nr, Nrv
Precision
P Nvr / Nv
Probability of returned document to be relevant
to the user
Recall
R Nvr / Nr
Probability of relevant document to be returned
to the user

28
Precision and Recall

Both coefficients depend on the feeling of the
questioner
The same document can fulfill information needs
of first questioner while at the same time fail
to meet them for the second one.
Each user determines different values of Nr and
Nrv coefficients
Both measures P and R depend on them

29
Precision and Recall

In optimal case
PR1
There are all and only relevant documents in the
response of the system
Usually
The answer in the first iteration is neither
precise nor complete

Optimalanswer
1
Typical initial answer
0
0
1
30
Precision and Recall

Query tuning
Iterative modification of the query targeted to
increase the quality of the response
Theoretically it is possible to reach the optimum
sooner or later

R
1
Optimum
0
P
0
1
31
Presnost a úplnost

due to (not only) ambiguities both measures
depend indirectly each on the other,ie. PR ?
const. lt 1
In order to increase P the absolute number of
relevant documents in the response is decreased.
In order to increase R the number of irrelevant
documents rapidly grows.
The probability to reach quality above the limit
is low.

R
1
Optimum
0
P
0
1
32
Prediction Criterion

In time of query formulation the questioner has
to guess correct term (words) the author used for
expression of given idea
Problems are caused e.g. by
Synonyms (author could use different synonym not
remembered by the user)
Overlapping meanings of terms
Colorful poetical hyperboles

33
Prediction Criterion

The problem can be partly suppressed by inclusion
of thesaurus, containing
Hierarchies of terms and their meanings
Sets of synonyms
Definitions of associations between terms
Questioner can use it during query formulation
System can use it during query evaluation

34
Prediction Criterion

The user often tends to tune its own query in
conservative way
He/she tends to fix terms used in the first
iteration (they must be the best because I
remembered them immediately) and vary only
additional terms at the end of the query
It is useful to support the user to
(semi)automatically eliminate wrong terms and
replace them with useful ones, that describe
really relevant documents

35
Maximal Criterion

The questioner is usually not able or willing to
go through exhaustive number of hits in the
response to find out the relevant one
Usually max. 20-50 documents according to their
length
Need to not only sort out documents not matching
the query but order the answer list according to
supposed relevancy in descendant order the
supposedly best documents at the begining

36
Maximal Criterion

Due to maximal criterion, the user usually tries
to increase the Precision of the answer
Small amount of resultingdocuments in the
answer, containing as high ratio of relevant
documents as possible
Some problematic domains requires both high
precision and recall
Lawyers, especially in territories having case
law based on precedents (need to find out as much
similar cases as possible)

37
Exact pattern matching
38
Why to Search for Patterns in Text

Due to index documents or queries
To involve only given set of terms (lemmas)
To omit given set of meaningless terms (lemmas)
as conjunctions, numerals, pronouns,
To highlight given terms in documents, presented
to users

39
Algorithms classificationby preprocessing

I - Brute-force algorithm
II - Others (suitable for TRS)
Further divided according to
Number of simultaneously matched patterns
1, N, ?
Direction of comparison
Left to right
Right to left

40
Class II Algorithms
41
Exact Pattern Matching

Searching of One Pattern
Within Text

42
Brute-force Algorithm

Let m denotes length of text t,let n denotes
length of pattern p.
If i-th position in text doesnt match j-th
position in pattern
Shift of pattern one position to the
right,restart comparison at first (leftmost)
position in the pattern
Average time complexity o(mn), e.g. in search
of an-1b in am-1b
For natural language text/pattern mconst ops,
i.e. o(m)
const is small
number
(lt10), dependent

on the language

43
Knuth-Morris-Pratt Algorithm

Left to right searching for one pattern
In comparison with brute-force algorithm KMP
eliminates repeated comparison of already
successfully compared characters of text
Pattern is shifted as less as possible to align
own prefix of examined part of pattern below
equal fragment of text

44
KMP Algorithm
45
KMP Algorithm

In front of mismatch position is left own prefix
already examined part of pattern
It has to be equal to the postfix of already
examined part of pattern
The longest such a prefix determines the smallest
shift

46
KMP algoritmus
47
KMP algoritmus

If
j-th position of pattern p doesnt match to i-th
position of text t
The longest own prefix of already examined part
of pattern equal to the postfix of already
examined part of pattern is of length k
then
After the shift k characters remain before the
mismatch position
Comparison restarts from k1st position of the
pattern
Restart positions are pre-computed and stored in
auxiliary array A
In this case Aj k1

48
KMP algoritmus

begin KMP m length(t) n length(p)
i 1 j 1 while (i lt m) and (j lt n) do
begin while (j gt 0) and (pj ltgt ti) do
j Aj inc(i) inc(j) end
whileif (j gt n) then pattern found at
position i-j1 else not foundend KMP

49
Obtaining of array A for KMP search

A1 0
If all values are known for positions1 .. j-1,
it is easy to compute correct value for j-th
position
Let Aj-1 contains corrections for j-1st
position.I.e., Aj-1-1 chars at the beginning
of pattern are the same as equivalent number of
chars before j-1st positon

50
Obtaining of array A for KMP search
51
Obtaining of array A for KMP search

If j-1st position of pattern match to Aj-1 th
position,the prefix can be prolongedand so
correct value of Aj is by one higher, than the
previous value.

52
Obtaining of array A for KMP search
53
Obtaining of array A for KMP search

If j-1st and Aj-1 th positions in pattern
doesnt match, the correction Aj-11 would
cause mismatch at the previous position in text
The correction for such a mismatch is already
known(numbers A1 .. Aj-1 are already
computed)

54
Obtaining of array A for KMP search

It is necessary to follow correction starting by
j-1 st position until j-1 st position in pattern
match to the found position in the target
position, or the correction reaches 0 (out of
pattern)

55
Obtaining of array A for KMP search
56
Obtaining of array A for KMP search
57
Obtaining of array A for KMP search - algorithm

begin A1 0 n length(p) j 2
while (j lt n) do begin k j-1 l k
repeat l Al until (l
0) or (pl pk) Aj l1
inc(j) endend

58
KMP algorithm

Time complexity of KMP is o(mn).
Already successfully compared positions in text
are never checked again
After each shift of pattern the given mismatch
position can be checked again, but there are at
most o(m) shifts of pattern.
Similarly time complexity of preprocessing is
o(n).

59
KMP Optimization

It is possible to further optimize auxiliary
array A
If the character pj equals to pAj,there
would be the same character as the one that
caused the mismatch aligned to mismatch position.
In this case the optimization can be computed in
advance in another auxiliary array Awhere Aj
def AAj
Else Aj def Aj
Array A j can be usedduring the search phase

60
Boyer-Moore Algorithm

Right to left search of one pattern using pattern
preprocessing
Pattern is shifted left to right
Characters of pattern are compared from right to
left

61
Boyer-Moore Algorithm

If the mismatch of n-j th position of pattern
against i-j th position of text, where Ti-jx
occures, where
n denotes length of pattern,
i denotes position of the end of pattern in text,
j0..n-1
x?X, X is the alphabet
Pattern is moved by SHIFTn-j,x characters to
the rights
The comparison restarts at the end of pattern,
i.e. for j0

62
Boyer-Moore Algorithm

There exists more different definitions of
SHIFTn-j,x
Variant 1 Auxiliary array SHIFT0..n-1,X is for
each position in the pattern and for each
character of the alphabet X defined as follows
The smallest possible shift, aligning the
character x in the text at the mismatch position
with the same character in the pattern.
If there exists no such character x in the
pattern left to the mismatch position, shift the
pattern to start immediately after the mismatch
position.

63
Boyer-Moore Algorithm (1)

Average time complexity is o(mn),e.g. for
searching ban-1 in am-nban-1
For huge alphabetsand patterns with small number
of different characters (especially for words
searched in texts in natural languages) the
average time complexity is o(m/n)
i.e. the longer the pattern, the more efficient
search

64
Boyer-Moore Algorithm (1)

Example

65
Boyer-Moore Algorithm (1)

Representation of SHIFT array for pattern
ANANAS
Full arrows depicts successful comparison of one
character
Other arrows stands for shift of target character
to position of starting character
Not present arrows means shift after the mismatch
position

66
Boyer-Moore Algorithm (1)

Another representation. To save the space
complexity x?A,N,S,X
X stands for any character not apearing in the
pattern
Values beginning with represents the length
of shift
Values without represents new value of j

67
Benchmark on Artificial Text
68
Benchmark on Artificial Text
69
Benchmark on Artificial Text
70
Benchmark on English Text

Note
Unique pattern ? found at its original position

71
Benchmark on English Text
72
Benchmark on English Text
73
Benchmark on English Text
74
Benchmark on English Text
75
Review of Algorithms
76
Exact pattern matching

Searching for finite set of patterns

77
Aho-Corrasick Algorithm

Left to right searching of more patterns
simultaneously
Extension of KMP algorithm
Preprocessing of patterns
Linear reading of text
Average time complexity o(m?ni), wherem denotes
length of textni denotes length of i-th pattern

78
A-C Algorithm

Text T
Set of patternsPP1, P2, , Pk
Search engineS (Q, X, q0, g, f, F)

Q finite set of states
X alphabet
q0?Q initial state
g Q x X ? Q (go)forward function
f Q ? Q (fail)backward function
F ? Q set of final states

79
A-C Algorithm

States in the set Q correspond to all prefixes of
all patterns
State q0 reprezents empty prefix ?
g(q,x) qx,iff qx?Q
Else g(q0,x)q0
Else g(q,x) undefined

f(q) for qltgtq0 is equal to longest own postfix q
in the set Q
f(q)ltq
Final states correspond to all complete patterns,
i.e. FP

80
A-C Algorithm

Search based on total (fully defined) transition
function ?(q,x) QxX?X
?(q,x) g(q,x), iff g(q,x) is defined
?(q,x) ?(f(q),x)
Correct definition, because f(q) - distance of
f(q) from initial state is less than q and
g(q0,x) is completely defined.

81
A-C Algorithm

f is constructed in order of increasing q, i.e.
according to distance of state from the beginning
It is not necessary to define f(q0)
If q1 the longest own postfix is empty, i.e.
f(q)q0

f(qx)f(g(q,x)) ?(f(q),x)
To determine value of fail function for state qx,
accessible from state q using character x, it is
necessary to start in q, follow fail function to
f(q) and then go forward using the character x

82
A-C Algorithm

Example Phe,her,she, function g

83
A-C Algorithm

Example Phe,her,she, function f

84
A-C Algorithm

Detection of all occurrences of patterns, even of
patterns hidden inside another ones
Either collect all patterns detected in given
state by going through all states accessible from
it using fail function, i.e. final states in f
i(q), igt0
Or - after transition to state q go through all
states linked together by fail function and
report all final states

85
A-C Algorithm delta function

function delta(qstates x alphabet)statesbegi
n delta while gq,x fail do q fq
delta gq,xend deltabegin A-C q
0 for i 1 to length(t) do begin q
delta(q,ti) report(q) report all found
patterns, ending by ti end forend A-C

86
KMP vs. A-C for 1 pattern

Equal algorithms, different formulations
j ( compared position)
P10
Pjk

qj-1 ( compared positions)
g(q0,)q0
f(qj-1)qk-1

87
Commentz-Walter Algorithm

Right to left search for more patterns
simultaneously
Combination of B-M and A-C algorithms
Average time complexity (for natural
languages)o(m/min(ni)), wherem denotes length
of textsni denotes length of i-th pattern

88
C-W Algorithm

Text T
Set of patternsPP1, P2, , Pk

Search engineS (Q, X, q0, g, f, F)
Q finite set of states
X alphabet
q0?Q initial state
g Q x X ? Q (go)forward function
f Q ? Q (fail)backward function
F ? Q set of finalstates

89
C-W Algorithm

States in set Q represents all postfixes of all
patterns
State q0 represents empty postfix ?
g(q,x) xq,iff xq?Q

f(q) where qltgtq0 is equal to longest own prefix q
in the set Q
f(q)ltq
Final states correspond to all complete patterns,
i.e. FV

90
C-W Algorithm

Forward function

91
C-W Algorithm

Backward function (arrows going to q0 are not
shown)

s
h
e
e
he
she
r
h
e
r
er
her
92
C-W Algorithm

LMIN min(ni)length of the shortest pattern
h(q) qdistance of state q from the initial
state
char(x)minimal distance of state, reachable via
character x
pred(q)predecessor of state q,i.e. the state,
representing one character shorter postfix

If g(q,x) is not defined, patterns (search
engine) is shifted by shift(q,x) positions to the
right and the again search restarts by state q0
again
shift(q,x)min( max( shift1(q,x), shift2(q)
), shift3(q) )

93
C-W Algorithm

shift1(q,x) char(x)-h(q)-1, pokud gt 0
shift2(q) min(LMIN?h(q)-h(q), f(q)q)
shift3(q0) LMIN
shift3(q) min(shift3(pred(q)) ? ?
h(q)-h(q), ?kfk(q)q ? q?F)

94
C-W Algorithm

shift1(q,x) aligning of collision
characterchar(y)-h(kolo)-18-4-13

3
95
C-W Algorithm

shift2(q) aligning of checked part of
textstates, where fail function goes to qmust
be taken into account

1
96
C-W Algorithm

shift3(q) aligning of (any) postfix of checked
text, collision character need not be used again
to find a match

97
Exact Pattern Matching

Searching for (Regular) Infinite Set of Patterns
in Text

98
Regular expressions and languages

Regular expression R
Atomic expressions
?
?
a, a ? X
Operations
U.V concatenation
UV union
Vk V.VV
V V0V1V2
V V1V2V3

Value of expression h(R)
? empty language
? empty word only
a, a ? X
u.vu?h(U)? v?h(V)
h(U)?h(V)

99
Regular Expression Feature

1) U(VW) (UV)W
2) U.(V.W) (U.V).W
3) UV VU
4) (UV).W (U.W)(V.W)
5) U.(VW) (U.V)(U.W)
6) UU U
7) ?.U U
8) ?.U ?
9) U? U
10) U ?U.U (?U)

100
(Deterministic) Finite Automaton

K ( Q, X, q0, ?, F )
Q is a finite set of states
X is an alphabet
q0 ? Q is an initial state
? Q x X ? Q is totally defined transition
function
F ? Q is a set of final states

101
(Deterministic) Finite Automaton

Configuration of FA
(q,w) ? Q x X
Transition of FA
relation ? (Q x X) x (Q x X)
(q,aw) (q,w) ? ?(q,a) q
Automaton accepts word w(q0, w) (q,?), q?F

102
Non-deterministic Finite Automaton

a) default def. K ( Q, X, q0, ?, F )b)
extended def. K ( Q, X, S, ?, F )
Q is a finite set of internal states
X is an alphabet
q0 ? Q is an initial state S ? Q is
(alternatively) set of initial states
? Q x X ? P(Q) is a transition function
F ? Q is a set of final states

103
Non-deterministic Finite Automaton

NKA for Phe, her, she
S1,4,8
F3,7,11

S1
F3,4,7

104
NDFA?DFA Conversion

K(Q, X, S, ?, F)
K(Q, X, q0, ?, F)

Q P(Q)
X
q0 S
?( q, x) ??(q, x), q?q
F q?Q?q?F??

105
NDFA?DFA ConversionSet of Initial States Allowed

By table, only reachable states
transitions
to state 1
not shown

106
NDFA?DFA ConversionOnly One Initial State Allowed

By table, only reachable states
transitions
to state 1
not shown

107
Derivation of regular expression

If, , then
I.e., if ,then

108
Derivation of regular expression

109
Construction of DFAUsing Derivations of RE

Derivation of regular expressions allows directly
and algorithmically build DFA for any regular
expression
Let V is given regular expression in alphabet X
Each state of DFA defines a set of words, that
move the DFA from this state to any of final
states.So, every state can be associated with
regular expression, defining this set of words
q0 V
?(q,x)
F q?Q ??h(q)

110
Construction of DFAUsing Derivations of RE

V (01).01 in alphabet X0,1
q0 (01).01

111
Construction of DFAUsing Derivations of RE

V (01).01 in alphabet X0,1
q0 (01).01
F (01).01?

112
Document Models

Different variants of models
Takes (non)existence of terms in documents into
account or not
Takes frequencies of terms in documents into
account or not
Takes positions of terms in documents into
account or not

113
Document Models in TRSs

Boolean Model

114
Boolean Model of TRS

Mid of 20. century
Adoption of procedures, used in
librarianshipand their gradual implementation

115
Boolean Model of TRS

Database (collection) D containing n documenta
Dd1, d2, dn
Documents described using m terms
T t1, t2, tm
term tj descriptor, usually word or collocation
Each document is represented as a subset of
available terms
Contained in the document
Better describing content of the document
d1 ? T

116
Boolean Model of TRS

Assigning of a set of terms to document can be
achieved by different approaches
Subdivision according to author
Manual
Done by a human indexer, that understands the
content of document
Non-consistent. More indexers need not produce
the same set of terms. One indexer might later
produce different set of terms as before.
Automatic
Done algorithmically
Consistent, but without text understanding
Subdivision according to free will in selecting
descriptors
Controlled
Set of terms is defined in advance and indexer
cannot change it. It only can select those
describing given document as best as possible.
Non-controlled
The set of terms can be extended whenever new
document is inserted into collection.

117
Indexation

Thesaurus
Internally structured set of terms
Synonyms with defined preferred term
Hierarchies of semantically narrower/broader
terms
Similar terms
...
Stop-list
Set of non-significant terms that are meaningless
for indexation
Pronouns, interjections,

118
Indexation

Common words arenot suitable for document
identification
Too specific words as well. Lot of different
terms appears in verysmall number of docs
Its eliminationdecreases significantly size of
the index, and slightly its quality

119
Boolean Model of TRS

Query is represented by Boolean expression
ta AND tb document has to contain/to be
described by both terms
ta OR tb document has to contain/to be
described by at least one term
NOT t document has not contain/to be
described by given term

120
Boolean Model of TRS

Query examples
searching AND information
encoding OR decoding
processing AND (document OR text)
computer AND NOT personal

121
Boolean Model of TRS Extensions

Collocations in queries
searching for information
data encoding OR data decoding
text processing
computer AND NOT personal computer

122
Boolean Model of TRS Extensions

Using of factual meta-data(attribute values)
databaseAND (author Salton)
text processingAND (year_of_publishing gt 1990)

123
Boolean Model of TRS Extensions

Wildcards in terms
datab AND system
stands for termsdatabase, databases,system
, systems, etc.
portabl? AND computer
stands for terms portable,computer,
computers, computerized etc.

124
Boolean Index Structure

Inverted file
It holds a list of identified documents for each
term (instead of a set of terms for each
document)
t1 d1,1, d1,2, ..., d1,k1
t2 d2,1, d2,2, ..., d2,k2
tm dm,1, dm,2, ..., dm,km

125
Boolean Index Structure

One-by-one processing of inserted
documentsproduces a sequence of couples
ltdoc_id,term_idgtsorted by first component, i.e.
by doc_id
Next the sequence is reordered lexicographically
by term_id, doc_id and duplicates are removed
The result can be further optimized by adding
directory pointing to sections, corresponding to
individual terms, and removing term_ids from the
sequence

126
Lemmatization and Disambiguationof Czech
Language (ÚFAL)

Odpovedným zástupcem nemuže být každý.
Zákon by mel zajistit individualizaci
odpovednosti a zajištení odbornosti.

ltp n1gtlts id"docID001-p1s1"gtltf
capgtOdpovednýmltMDlgtodpovedný_(kdo_za_neco_odpoví
dá)ltMDtgtAAIS7----1A----ltfgtzástupcemltMDlgtzástupce
ltMDtgtNNMS7-----A----ltfgtnemuželtMDlgtmoci_(mít_možn
ost_neco_delat)ltMDtgtVB-S---3P-NA---ltfgtbýtltMDlgtb
ýtltMDtgtVf--------A----ltfgtkaždýltMDlgtkaždýltMDtgtAAIS
1----1A----
ltp n2gt

Paragraph Nr.
Sentence Nr.
Word in document
Lemma including meaning
Type of word (Adverb),
127
Proximity Constraints

t1 (m,n) t2
most general form
term t2 can appear at most m words after t1, or
term t1 can appear at most n words after t2.
t1 sentence t2
terms have to appear in the same sentence
t1 paragraph t2
terms have to appear in the same paragraph

128
Proximity Constraints Evaluation

Using the same index structure
Operators replaced by conjunctions
Query evaluation to find candidates
Check for co-occurrences in primary texts
Small index
Longer time needed for evaluation
Necessity of storing primary documents
Extension of index by positions of term
occurrences in documents
Large index

129
Extended Index Structure

During indexation is built a sequence of
5-tuplesltdok_id,term_id,para_nr,sent_nr,word_nrgt
ordered by dok_id, para_nr,sent_nr,word_nr
Sequence is reordered byltterm_id,dok_id,para_nr,s
ent_nr,word_nrgt
No duplicities are removed

130
Thesaurus Utilization

BT(x) - Broader Term to term x
NT(x) - Narrower Terms
PT(x) - Preferred Term
SYN(x) - SYNonyms to term x
RT(x) - Related Terms
TT(x) - Top Term

131
Disadvantages of Boolean Model

Salton
Query formulation is more an art than science.
Hits can not be rate by its quality.
All terms in the query are taken as equally
important.
Output size can not be controlled. System
frequently produces empty or very large answers.
Some results doesnt correspond to intuitive
understanding.
Documents in answer to disjunctive query can
contain only one of mentioned term as well as all
of them.
Documents eliminated from answer to conjunctive
query can miss one of mentioned term as well as
all of them.

132
Partial Answer Ordering

Q (t1 OR t2) AND (t2 OR t3) AND t4
conversion to equivalent DNF
Q (t1 AND t2 AND t3 AND t4)
OR (t1 AND t2 AND NOT t3 AND t4)
OR (t1 AND NOT t2 AND t3 AND t4)
OR (NOT t1 AND t2 AND t3 AND t4)
OR (NOT t1 AND t2 AND NOT t3 AND t4)

133
Partial Answer Ordering

Each elementary conjunction (further EC) contain
all terms used in original query and is rated by
number of terms used in positive way (without
NOT)
All ECs differs each from another in at least
one term (one contains tj, second contains NOT
tj)
Every document correspond to at most one EC
Document is then rated by number, assigned to
given EC.

134
Partial Answer Ordering

There exist 2k ECs in case of query using k
terms
There exist only k different ratings
More ECs can have the same rating
(ta OR tb) (ta AND tb) rating 2 OR (ta
AND NOT tb) rating 1 OR (NOT ta AND tb)
rating 1

135
Vector Space Model of TRS

70th of 20. century
cca 20 years younger than Boolean model of TRS
Tries to minimize and/or eliminate disadvantages
of Boolean model

136
Vector Space Model of TRS

Database D containing n documents
Dd1, d2, dn
Documents are described by m terms
T t1, t2, tm
term tj word or collocation
Document representation usingm-dimensional
vector of term weights

137
Vector Space Model of TRS

Document model
wi,j level of importance of j-th termto
identify/describe i-th document
Query
qj level of importance of j-th term for the
user

138
Vector Space Model Index
139
Vector Space Model of TRS

Similarity between vectors representing document
and query is in general defined by Similarity
function

1
0
0
1
140
Similarity Functions

Factor is proportional both to
level of importance in document and for the user
Orthogonal vectors have zero similarity
Base vectors in the vector space (individual
terms) are orthogonal each to other and so have
zero similarity

141
Vector Space Model of TRS

Not only the angle, but also sizes of vectors
influence the similarity
Longer vectors, that tends to be assigned to
longer texts have an advantage on shorter ones
Its desirable to normalize all vectors to have
unitary size

142
Vector normalization

Vector length influence elimination

143
Vector normalization

In time of indexation
No overhead in time of searching
Sometimes it is necessary to re-compute all
vectors in case that vectors reflects also
aspects dependent on complete collection
In time of search
Part of similarity function definition
Slows down the response of the system

144
Output Size Control

Documents in the output list are ordered by
descending similarity to the given query
Most similar documents at the beginning of the
list
The list size can be easily restricted with
respect to maximal criterion
The maximal number of documents in the list can
be restricted
Only documents reaching threshold similarity can
be shown in the result

145
Negation in Vector Space Model

It is possible to extend query space
Then the contribution of j-th
dimension can be negative
Documents that contain j-th term are suppressed
in comparison with others

146
Scalar product
147
Cosine Measure (Salton)
148
Jaccard Measure
149
Dice Measure
150
Overlap Measure
151
Asymmetric Measure
152
Pseudo-Cosine Measure
153
Vector Space Model Indexation

Based on number of occurrences of given term in
given document
The more given word occurs in given document, the
more important for its identification
Term FrequencyTFi,j term_occurs / all_occurs

154
Vector Space Model Indexation

Without stop-list the result contains almost only
meaningless words at the beginning

155
Vector Space Model Indexation

Term frequencies are very small even for most
frequent terms
Normalized term frequency iffelse

156
Vector Space Model Indexation
Differentiation of important termsfrom
non-important ones
157
Vector Space Model Indexation

IDF (Inverted Document Frequency) reflects
importance of given term in the index for
complete collection

Entropy of probability that the term occurs in
randomly chosen document
158
Vector Space Model Indexation

(Optional) document vector normalization to
unite size

159
Querying in Vector Space Model

Equal representation of documents and queries
brings many advantages over Boolean Model
Query can be defined
Directly by its hand-made definition
By reference to known indexed document
By reference to non-indexed document indexer
creates ad-hoc vector from its primary text
By text fragment (using copy-paste etc.)
By combination of some above mentioned ways

160
Feedback

Query building/tuning based on user feedback to
previous answers
Adding terms identifying relevant documents
Elimination of terms unimportant for relevant
document identification and important for
irrelevant ones
Prediction criterion improvement

161
Feedback

Answer to previous queryis classified by the
user, who can mark relevant and/or irrelevant
documents

162
Positive Feedback

Relevant document attract the query towards them

163
Negative Feedback

Irrelevant documents push query away from them
Less effective than positive feedback
Less used

164
Feedback

The query iteratively migrates towards the center
of relevant documents

165
Feedback

General form
One of used special form

Centroid (centre of gravity)
166
Feedback

General form
Other used(weighted) form

(1-?) / ?
Weighted centroid (centre of gravity)
167
Term Equivalence in VS Model

Individual terms (dimensions of the space) are
supposedly, but not really, mutually
independent
Problem with prediction
inappropriately chosen
synonyms

168
Term Equivalence in VS Model

Equivalency matrix

169
Term Similarity in VS Model

Generalised equivalence
Similarity matrix
All computation used in VS model can be evaluated
also on transposed index.Here mutual similarity
of term can be evaluated
(vector dimension n, not m)
Really similar terms co-occurs often together
Common terms co-occurs often together as well

170
Term Hierarchies in VS Model

Similarly toBoolean Model

Publication
Print
Book
Papers
Magazine
171
Term Hierarchies in VS Model
0.8

Similarly toBoolean Model
Edges can haveassigned weights
User weightsthen can be easily propagated

Publication
0.4
0.6
0.48
0.32
Print
Book
0.3
0.7
0.224
0.096
Papers
Magazine
172
Citations and VS Model

Scientific publications cites their sources
Assumption
Cited documents are semantically similar
Citing documents are semantically similar

173
Citations and VS Model

Direct reference between documents A a B
Document A cites document B
Denoted A?B
Indirect reference between A a B
Ex. C1, Ck so, that A?C1??Ck?B
Link between documents A a B
A?B or B?A

174
Citations and VS Model

A and B are bibliographically paired,if and only
if they cite the same source CA?C ? B?C

A and B are co-cited, if and only if they are
both cited in some document CC?A ? C?B

175
Citations and VS Model

Acyclic oriented citation graph

Flowchart matrix of the citation graph
Ccij?0,1ltnxngtcij1, iff i?jcij0 else

176
Citations and VS Model

BP matrix of bibliographic pairing
bpij number of documents cited in both
documents i and j.
Follows bpii number of documents cited in i

177
Citations and VS Model

CP matrix of co-citation pairing
cpij number of documents citing both i and j
Follows cpii number of documents citing i

178
Citations and VS Model

DL matrix of document links
dlij 1 ? (cij 1 ? cji 1)
It is possible to modify resulting similarities
between documents and given query using some of
matrices BP, CP, DL
Modification of index matrix D
D BP.D, resp. DCP.D , resp. DDL.D
DBP.CP.DL.D

179
Using mutual document similarities in VS Model

DS matrix of mutual document similarities
dsij
The same idea as in case of BP, CP, DL
Modification of index matrix D
DDS.D

180
Term Discrimination Values

Discrimination value defines the importance of
the term in the vector space to distinguish
individual documents stored in the collection
By removal of the term from index, i.e. by
reduction of index dimensionality it can happen
Overall distance between documents
decreases(average similarity of document pairs
increases)
Overall distance between documents
increases(average similarity of document pairs
decreases)
In this case the presence of the dimension in the
space is not needed (is contra-productive)

181
45,0?
35,3?
0,0?
45,0?
182
Term Discrimination Values

Computation based on average document similarity

More efficient variant using central document
(centroid)

183
Term Discrimination Values

The same value is computed for the space reduced
by k-th dimension

184
Term Discrimination Values

Discrimination value is defined as a difference
of both average values
Can be used instead of IDFk

? 0
Important termdiscriminating documents
DVk defines the measure of importance
? 0
Unimportant term

185
Term Discrimination Values(value DV of terms
depending on number of documents where the term
is present)
Number of documents, where the term is present.
Collection contains 7777 documents
186
Document clustering

Kohonen maps
C3M algorithm
K-mean algorithm

187
Document Clustering

Response time of VS based TRS is directly
proportional to number of documents in the
collection,that must be compared with the query
Clustering allows to skip major part of index
during the search and compare only closest
documents

188
Document Clustering

Without clusters, it is necessary to compare all
documents, even if the minimal needed similarity
is defined

189
Document Clustering

Each cluster represent m-dimensional sphere,
defined by its center and radius
If not, it is possible to approximate it this way
during computations

190
Document Clustering

Having clusters, the query evaluation need not to
compare documents in clusters outside the area of
user interest

191
Cluster types

Clusters having the same volume
Easy to create
Some clusters can be almost empty, while others
can contain huge amount of documents

192
Cluster types

Clusters having (approximately) the same number
of documents
Hard to create
More effective in case of non-uniformly
distributed docs.

193
Cluster types

Non-disjunctive clusters
One document can belong to more than one cluster
Sometimes weighted belonging in fuzzy clusters.

194
Cluster types

Disjunctive clusters
Document can belong to exactly one cluster

195
Cluster types

It is not possible to completely and disjointly
cover space using spheres
It is possible to use convex polyhedra, where
each document belongs to closest center

196
Cluster types

Then clusters can be approximated by non-disjoint
set of spheres, defined by the center and the
most distant belonging document

197
Query Evaluation With Clusters (I)

Let are given query q and minimal required
similarity s
Note. Similarity computed by scalar product,
vectors are normalized
Index is split to k clusters(c1,r1), , (ck,rk)
Note. Radii are angular
Query radiusr ? arccos(s)s cos(?)

r?
1
?
1
198
Query Evaluation With Clusters (I)

Emptiness of cluster intersection with the query
area is found out from the value
arccos(Sim(q,ci))-r-ri
If this value ? 0,documents in the cluster are
compared
If this values gt 0,documents can not be in the
result

199
Query Evaluation With Clusters (II)

Let are given query q and maximal number of
required documents x.
Again, index is split to k clusters(c1,r1), ,
(ck,rk)
No radius of the query is available

200
Query Evaluation With Clusters (II)

Clusters are sorted in ascended order by
increasing distance of their center from the
query, i.e. according to arccos(Sim(q,ci))
Better sorted by increasing distance of cluster
boundary from the query, i.e. according to
arccos(Sim(q,ci))-ri

201
Query Evaluation With Clusters (II)