Title: Natural Language Processing Applications
1Natural Language Processing Applications
- Lecture 7
- Fabienne Venant
- Université Nancy2 / Loria
2Information Retrieval
3What is Information Retrieval?
- Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers) - Applications
- Many universities and public libraries use IR
systems to provide access to books journals and
other documents. - Web search
- Large volumes of unstable, unstructured dat
- Speed is important
- Cross-language IR
- Finding documents written in another language
- Touches on Machine translation
- ....
4Concerns
- The set of texts can be very large hence hence
efficiency is a concern - Textual data is noisy, incomplete and
untrustworthy hence robustness is a concern - Information may be hidden
- Need to derive information from raw data
- Need to derive information from vaguely expressed
needs
5IR Basic concepts
- Information needs queries and relevance
- Indexing helps speeding up retrieval
- Retrieval models describe how to search and
recover relevant documents - Evaluation IR systems are large and convincing
evaluation is tricky
6Information needs
7Information needs
- INFORMATION NEED the topic about which the
user desires to know more - QUERY what the user conveys to the computer in
an attempt to communicate the information need - RELEVANCE a document is relevant if it is one
that the user perceives as containing information
of value wrt their personal information need - Ex
- topic pipeline leaks
- relevant documents doesnt matter if they use
those words or express the concept with other
words such a pipeline rupture .
8Capturing information needs
- Information needs can be hard to capture
- One possibility use natural language
- Advantage expressive enough to allow all needs
to be described - Drawbacks
- Semantic analysis of arbitrary NL is very hard
- Users may not want to type full blown sentences
into a search engine
9Queries
10Queries
- Information needs are typically expressed as a
query - Where shall I go on holiday? ? holiday
destinations - Two main types of possible queries
- How much blood does the human heart pump in one
minute? - Boolean queries
- ? heart AND blood AND minutes
- Web types queries
- ? human biology
11Remarks
- A query
- is usually quite short and incomplete
- may contain misspelled or poorly selected words
- may contain too many or too few words
- The information need
- may be difficult to describe precisely,especially
when the user isn't familiar about the topic - Precise understanding of the document content is
difficult.
12Persistent vs one-off Queries
- Queries might or not evolve over times
- Persistent queries
- predefined and routinely performed
- Top ten performing shares today
- Continuous queries persistent queries that
allow users to receive new results when they
become available - typical of Information extraction and News
Routing systems - One-off (or ad-hoc) queries
- created to obtain information as the need arises
- typical of Web searching
13Relevance
- Relevance is subjective
- python ambiguous but not for user
- Topicality vs. Utility a document is relevant
wrt a specific Goal - ? A document is relevant if it addresses the
stated information need, not because it just
happens to contain all the words in the query. - Relevance is a gradual concept (a document is not
just relevant or not it is more or less relevant
to a query) - IR systems usually rank retrieved documents by
relevance - But many algorithm use a binary decision of
relevance.
14The big picture
15Terminology
- An IR system looks for data matching some
criteria defined by the users in their queries. - The langage used to ask a question is called the
query language. - These queries use keywords (atomic items
characterizing some data). - The basic unit of data is a document (can be a
file, an article, a paragraph, etc.). - A document corresponds to free text (may be
unstructured). - All the documents are gathered into a collection
(or corpus).
16Searching for a given word in a document
- One way to do that is to start at the beginning
and to read through all the text - Pattern matching (re) speed of modern computer?
grepping through tex can be a very effective - Enough for simple querying of modest collections
(millions of words) - But for many purposes, you do need more
- To process large document collections (billions
ot trillions of words) quickly. - To allow more flexible matching operations. For
example, it is impractical to perform the query
Romans NEAR countrymen with grep, where NEAR
might be defined as within 5 words or within
the same sentence. - To allow ranked retrieval in many cases you
want the best answer to an information need among
many documents that contain certain words - -- gtYou need an Index
17Index
18Motivation for Indexing
- Extremely large dataset
- Only a tiny fraction of the dataset is relevant
to a given query - Speed is essential (0.25 second for web
searching) - Indexing helps speedup retrieval
19Indexing documents
- How to relate the users information need with
some documents content ? - Idea using an index to refer to documents
- Usually an index is a list of terms that appear
in a document, it can be represented
mathematically as - index doci? Uj keywordj
- Here, the kind of index we use maps keywords to
the list of documents they appear in - index' keywordj ? Ui doci
- We call this an inverted index.
20Indexing documents
- The set of keywords is usually called the
dictionary (or vocabulary) - A document identifier appearing in the list
associated with a keyword is called a posting - The list of document identifiers associated with
a given keyword is called a posting list
21Inverted files
- The most common indexing technique
- Source file collection organised by documents
- Inverted file collection organised by terms
22Inverted Index
- Given a dictionary of terms (also called
vocabulary or vocabulary lexicon) - For each term, record in a list which documents
the term occurs in - Each item in the list
- records that a term appeared in a document
- and, later, often, the positions in the document
- is conventionally called a posting
- The list is then called a postings list (or
inverted list),
23Inverted Index
From an introduction to information
retrieval , C.D. Manning,P. Raghavan and
H.Schütze
24Exercise
- Draw the inverted index that would be built for
the following document collection - Doc 1 breakthrough drug for schizophrenia
- Doc 2 new schizophrenia drug
- Doc 3 new approach for treatment of schizophrenia
- Doc 4 new hopes for schizophrenia patients
- For this document collection, what are the
returned results for these queries - schizophrenia AND drug
- schizophrenia AND NOT(drug OR approach)
25Indexing documents
- Arising questions how to build an index
automatically ? What are the relevant keywords ? - Some additional desiderata
- fast processing of large collections of
documents, - having flexible matching operations (robust
retrieval), - having the possibility to rank the retrieved
document in terms of relevance - To ensure these requirements (especially fast
processing) are fulfilled, the indexes are
computed in advance - Note that the format of the index has a huge
impact on the performances of the system
26Indexing documents
- NB an index is built in 4 steps
- Gathering of the collection (each document is
given a unique identifier) - Segmentation of each document into a list of
atomic tokens ? tokenization - Linguistic processing of the tokens in order to
normalize them ?lemmatizing. - Indexing the documents by computing the
dictionary and lists of postings
27Manual indexing
- Advantages
- Human judgement are most reliable
- Retrieval is better
- Drawbacks
- Time consuming
- Not always consistent
- different people build different indexes for the
same document.
28Automatic indexing
- Using NLU?
- Not fast enough in real world settings (e.g., web
search) - Not robust enough (low coverage)
- Difficulty what to include and what to exclude.
- Indexes should not contain headings for topics
for which there is no information in the document - Can a machine parse full sentences of ideas and
recognize the core ideas, the important terms,
and the relationships between related concepts
throughout the entire text?
29Building the vocabulary
30Stop list
- The members of which are discarded during
indexing - some extremely common words which would appear to
be of little value in helping select documents
matching a user need are excluded from the
vocabulary entirely. - These words are called STOP WORDS
- Collection strategy
- Sort the terms by collection frequency (the total
number of times each term appears in the document
collection), - Take the most frequent terms
- often hand-filtered for their semantic content
relative to the domain of the documents being
indexed - What counts as a stop word depends on the
collection - in a collection of legal article law can be
considered a stop word - Ex
- a an and are as at be by for from has he in is it
its of on that the to was were will with
31Why eliminate stop words?
- Efficiency
- Eliminating stop words reduces the size of the
index considerably - Eliminating stop words reduces retrieval time
considerably - Quality of results
- Most of the time not indexing stop words does
little harm - keyword searches with terms like the and by dont
seem very useful - BUT, this is not true for phrase searches.
- The phrase query President of the United States
is more precise than President AND United
States. - The meaning of flights to London is likely to
be lost if the word to is stopped out. - .....
32Building the vocabulary
- Processing a stream of characters to extract
keywords - 1st task tokenization, main difficulties
- token delimiters (ex Chinese)
- apostrophes (ex Oneill, Finlands capital)
- hyphens (ex Hewlett-Packard, state-of-the-art)
- segmented compound nouns (ex Los Angeles)
- unsegmented compound nouns (icecream, breadknife)
- numerical data (dates, IP addresses)
- word order (ex Arabic wrt nouns and numbers)
33Solutions for tokenization issues
- Using a pre-defined dictionary with largest
matches and heuristics for unknown words - Using learning algorithms trained over
hand-segmented words
34Choosing keywords
- Selecting the words that are most likely to
appear in a query - These words characterize the documents they
appear in - Which are they?
35The bag of words approach
- Extreme interpretation of the the principle of
compositional semnaics - The meaning of documents resides solely in the
words that are contained within them - The exact ordering of the terms in a document is
ignored but the number of occurrences of each
term is material
36BoW
- Not the same thing a bit! said the Hatter.
- You might just as well say that I see what I
eat is the same thing as I eat what I see! - You might just as well say, added the March
Hare, that I like what I get is the same thing
as I get what I like! - You might just as well say, added the Dormouse,
who seemed to be talking in its sleep, that I
breathe when I sleep is the same thing as I
sleep when I breathe!
37Bags of words
- Nevertheless, it seems intuitive that two
documents with similar bag of words
representations are similar in content..
38Whats in a bag of words?
- Are all words in a document equally important?
- stop words do not contribute in any way to
retrieval and scoring - BoW contain terms
- What should count as a term?
- Words
- Phrases (e.g., president of the US)
39Morphological normalization
- Should index terms be word forms, lemmas or
stems? - Matching morphological variants increase recall
- Example morphological variants
- anticipate, anticipating, anticipated,
anticipation - Company/Companies, sell/sold
- USA vs U.S.A.,
- 22/10/2007 vs 10/22/2007 vs 2007/10/22
- university vs University
- Idea using equivalence classes of terms,
- ex Opel, OPEL, opel ? opel
- Two techniques
- stemming refers to a crude heuristic process
that chops off the ends of words in the hope of
achieving this goal correctly most of the time - Lemmatisation refers to doing things,properly
with the use of a vocabulary and morphological
analysis of words, normally aiming to remove
inflectional endings only and to return a
dictionary form of a word, which is known as the
lemma. - NB documents and queries have to be processed
using the same tokenization process !
40Stemming and Lemmatization
- Role reducing inflectional forms to common base
forms, - Example
- car, cars, cars, cars ? car
- am, are, is ? be
- Stemming removes suffixes (surface markers) to
produce root forms - Lemmatization reduces a word to a canonical form
(using a dictionary and a morphological analyser) - Illustration of the difficulty
- plurals (woman/women, crisis/crisis)
- derivational morphology (automatize/automate)
- English ? Porter stemming algorithm (University
of Cambridge, UK, 1980)
41Porter stemmer
- Algorithm based on a set of context-sensitive
rewriting rules - http//tartarus.org/martin/PorterStemmer/index
.html - http//tartarus.org/martin/PorterStemmer/def.t
xt - Rules are composed of a pattern (left-hand-side)
and a string (right-hand-side), example - (.)sses ? \1 ss sses ? ss caresses ? caress
- (. aeiou.)ies ? \1i ies ? i ponies ? poni,
ties ?ti - (. aeiou.)ss ? \1 ss ss ? ss caress ?
caress - Rules may be constrained by conditions on the
words measure, example - (m gt 1) (.)ement ? \1 replacement ? replac
but not cement ? c - (mgt0) (.)eed -gt \1ee feed -gt feed but agreed
-gt agree - (v) ed -gt \1 plastered -gt plaster but bled
-gt bled - (v) ing -gt \1 motoring -gt motor but sing -gt
sing
42Porter StemmerWord measure
- Assumed that a list of consonants is denoted by
C, and a list of vowels by V - Any word, or part of a word has one of the four
forms - CVCV ... C
- CVCV ... V
- VCVC ... C
- VCVC ... V
- These may all be represented by the single form
- CVCVC ... V where the square brackets denote
arbitrary presence of their contents. -
- Using (VC)m to denote VC repeated m times, this
may again be written as - C(VC)mV.
- m will be called the measure of any word or word
part when represented in this form. - Here are some examples
- m0 TR, EE, TREE, Y, BY
- m1 TROUBLE, OATS, TREES, IVY
- m2 TROUBLES, PRIVATE, OATEN, ORRERY.
43Exercise
- What is the Porter measure of the following words
(give your computation) ? - crepuscular
- rigorous
- placement
- cr ep usc ul ar
- C VC VC VC VC
- m 4
- r ig or ous
- C VC VC VC
- m 3
- pl ac em ent
- C VC VC VC
- m 3
44Stemming
- Most stemmers also removes suffixes such as ed,
ing, ational, ation, able, ism... - Relational ? relate
- Most stemmers dont use lexical look up
- There are shortcomings
- Stemming can result in non-words
- Organization ? Organ
- Doing ? doe
- Unrelated words can be reduced to the same stem
- police, policy ?polic
45Stemming
- Popular stemmers
- Porters
- Lovins
- Iterated Lovins
- Kstem
46Lemmatization
- Exceptions needs to be handled
- sought ? seek, sheep ?sheep, feet ?foot
- Computationally more expensive than stemming as
it lookups words in a dictionnary - Lemmatizer for French
- http//bach.arts.kuleuven.be/pmertens/morlex/
- FLEMM (F. Namer)
- POS taggers with lemmatization TreeTagger, LT-POS
47What is actually used?
- Most retrieval systems use stemming/lemmatising
and stop word lists - Stemming increases recall while harming precision
- Most web search engines do use stop word lists
but not stemming/lemmatising because - the text collection is extremely large so that
the change of matching morphogical variants is
higher - recall is not an issue
- stemming is imperfect and the size and diversity
of the web increase the chance of a mismatch - stemming/tokenising tools are available for few
languages
48Example Text Representations
- Scientists have found compelling new evidence of
possible ancient - microscopic life on mars, derived from magnetic
crystals in a meteorit that fell to Earth from
the red planet, NASA anounced on Monday. - Web search scientists, found, compelling, new,
evidence, - possible, ancient, microscopic, life, mars,
derived, magnetic, crystals, - meteorite, fell, earth, red, planet, NASA,
anounced, Monday - Information service or library search scientist,
find, compelling, - new, evidence, possible, ancient, microscopic,
life, mars, derived, - magnetic, crystal, meteorite, fall, earth, red,
planet, NASA, - anounce, Monday
49Granularity
- Document unit
- An index can map terms
- ... to documents
- ... to paragraphs in documents
- ... to sentences in document
- ... to positions in documents
- An IR system should be designed to offer choices
of granularity. - For now, we will henceforth assume that a
suitable size document unit has been chosen,
together with an appropriate way of dividing or
aggregating files, if needed.
50Index Content
- The index usually stores some or all of the
following information - For each term
- Document count. How many documents the term
occurs in. - Total Frequency count. How many times the term
occurs accross all documents ? popularity
measure - For each term and for each document
- Frequency How often the term occurs in that
document. - Position. The offsets at which the term occurs in
that document.
51Retrieval model
52What is a retrieval model
- A model is an abstraction of a process here
retrieval - Conclusions derived by the model are good if the
model provides a good approximation of the
retrieval process - IR Model variables queries, documents, terms,
relevance, users, information needs - Existing types of retrieval models
- Boolean models
- Vector space models
- Probabilistic models
- Models based on Belief nets
- Models based on language models
53Retrieval Models the general intuition
- Documents and user information needs are
represented using index terms - Index terms serve as links to documents
- Queries consists of index terms
- Relevance can be measured in terms of a match
between queries and document index
54Exact vs. Best Match
- Exact Match
- A query specifies precise retrieval criteria
- Each document either matches or fails to match
the query - The result is a set of documents (no ranking)
- Best match
- A query describes good or best matching documents
- The result is a ranked list of documents
55Stastical retrieval
56Statistical Models
- A document is typically represented by a bag of
words (unordered words with frequencies) - User specifies a set of desired terms with
optional weights - Weighted query terms
- Q lt database 0.5 text 0.8 information
0.2 gt - Unweighted query terms
- Q lt database text information gt
- No Boolean conditions specified in the query.
56
57Statistical Retrieval
- Retrieval based on similarity between query and
documents. - Output documents are ranked according to
similarity to query - Similarity based on occurrence frequencies of
keywords in query and document - Automatic relevance feedback can be supported
- The user issues a (short, simple) query.
- The system returns an initial set of retrieval
results. - The user marks some returned documents as
relevant or nonrelevant. - The systemcomputes a better representation of the
information need base on the user feedback. - The system displays a revised set of retrieval
results.
57
58Boolean model
59The boolean model
- Most common exact-match model
- Basic assumptions
- An index term is either present or absent in a
document - All index terms provide equal evidence wrt
information needs - Queries are boolean combinations of index terms
- x AND y docts that contains both x and y
(intersection of addresses) - x OR y docts that contains x, y or both (union
of addresses) - NOT x docts that do not contain x (complement
set of addresses) - Additionnally,
- proximity operator
- simple regular expressions
- spelling variants
60Boolean queriesExample
- User information need
- ? interested in learning about vitamins that are
antioxidant - User boolean query
- ? antioxidant AND vitamin
61The boolean model
- Example of input collection (Shakespeares
plays) - Doc1
- I did enact Julius Caesar
- I was killed in the Capitol
- Brutus killed me.
- Doc2
- So let it be with Caesar. The
- noble Brutus hath told you Caesar
- was ambitious
62The boolean model index construction
- First we build the list of pairs (keyword,
docID))
63The boolean model index construction
- Then the lists are sorted by keywords, frequency
information is added
64The boolean model index construction
- Multiple occurences of keywords are then merged
to create a dictionary file and a postings file
65Processing Boolean queries
- User boolean query Brutus AND Calpurnia
- over the inverted index
- Locate Brutus in the Dictionary
- Retrieve its postings
- Locate Calpurnia in the Dictionary
- Retrieve its postings
- Intersect the two postings lists
- The intersection operation is the crucial one. It
has to be we efficient so as to be able to
quickly find documents that contain both terms. - sometimes referred to as merging postings lists
because it uses a merge algorithm - Merge algortihm general family of algorithms
that combine multiple sorted lists by interleaved
advancing of pointers through each list
66Intersection
67Extended boolean queries
- Merging algorithm (from Manning et al., 07)
NB the posting lists HAVE to be sorted.
68Extended boolean queries
- Generalisation of the merging process
- Imagine more than 2 keywords appear in the
query - (Brutus AND Caesar) AND NOT (Capitol)
- Brutus AND Caesar AND Capitol
- (Brutus OR Caesar) AND (Capitol
- ...
- Ideas
- consider keywords with shorter posting lists
first (to reduce the number of operations). - use the frequency information stored in the
dictionary - ? See Manning et al., 07 for the algorithm
69Extended boolean queries
retrieved docs D7, D5, D2
70Exercise
- How would you process the following queries (main
steps) - Brutus AND NOT Caesar
- Try your algorithm on
71Exercise
- How would you process the following query (main
steps) - Brutus OR NOT Caesar
72Remarks on the boolean model
- The boolean model allows to express precise
queries (you know what you get, BUT you do not
have flexibility ? exact matches) - Boolean queries can be processed efficiently
(time complexity of the merge algorithm is linear
in the sum of the length of the lists to be
merged) - Has been a reference model in IR for a long time
73Advantages of exact-match retrieval
- Predictable, easy to explain
- Structured queries
- Works well when information need is clear and
precise
74Drawbacks of exact-match retrieval
- Unintuitive for non experts adequate query
formulation difficult for most users - no ranking of retrieved documents
- exact matching may lead to too few or too many
retrieved documents - too few if not using synonyms
- difficulty increases with collection size
- large results sets need to be compensated by
interactive query refinement - No notion of partial relevance (useful if query
is overrestrictive) - All terms have equal importance (no term
weighing) - Ranking models consistently better
75Boolean modelThe story so far
- An inverted index associate keywords with posting
lists - The postings lists contain document identifiers
(and other useful information, such as total
frequences, number of documents, etc.) - Boolean queries are processed by merging posting
lists in order to find the documents satisfaying
the query - The cost of this list merging is time linear in
the total number of document Ids O(m n) - Question how to process phrase queries (i.e.
taking the words context into account) ?
76Dealing with phrases queries
- Many complex or technical concepts and many
organization and product names are multiword
compounds or phrases. - Stanford University
- Graph Theory
- Natural Language Processing
- ...
- The user wants documents were the whole phrase
appears, and not only some parts of it (i.e. The
inventor Stanford Ovshinsky never went to
university is not a match ) - About 10 of the web queries are phrase queries
(songs names, institutions...) - Such queries need either more complex dictionary
terms, or more complex index (critical parameter
size of the index)
77Biword indexes
- Use key-phrases of length 2, example
- Text Natural Language Processing
- Dictionary
- Natural Language
- Language Processing
- The dictionary is made of biwords (notion of
context) - Query Information retrieval in Natural Langage
Processing - (Information retrieval) and (retrieval Natural)
and (Natural Language) and (Language Processing) - It might seem a better query to omit the middle
biword. - Better results can be obtained by using more
precise part-of-speech patterns that define which
extended biwords should be indexed
78Positionnal indexes
- Store positions in the inverted indexes, example
- termID
- doc1 position1, position2, ...
- doc2 position1, position2, ..
- ....
- Processing then corresponds to an extension of
the merging algorithm (additional checkings while
traversing the lists) - NB such indexes can be used to process proximity
queries (i.e. using constraints on proximity
between words)
79- Positional indexes need an entry per occurence
(NB classic inverted indexes need an entry per
document Id) - The size of such indexes grows exponentially with
the size of the document - The size of a positional index depends on the
language being indexed and the type of document
(books, articles, etc) - On average, a positional index is 2-4 times
bigger than a inverted index, it can reach 35 to
50 of the size of the original text (for
English) - Positional indexes can be used in combination
with classic indexes to save time and space (see
Williams et al, 2005).
80Exercise
- Which documents can contain the sentence to be
or not to be considering the following
(incomplete) indexes ? - be
- 1 7, 18, 33, 72, 86, 231
- 2 3, 149
- 4 17, 191, 291, 430, 434
- 5 363, 367
- to
- 2 1, 17, 74, 222, 551
- 4 8, 16, 190, 429, 433
- 7 13, 23, 191
81Exercise
- Given the following positional indexes, give the
documents Ids corresponding to the query world
wide web - world
- 1 7, 18, 33, 70, 85, 131
- 2 3, 149
- 4 17, 190, 291, 430, 434
- wide
- 1 12, 19, 40, 72, 86, 231
- 2 2, 17, 74, 150, 551
- 3 8, 16, 191, 429, 435
- web
- 1 20, 22, 41, 75, 87, 200
- 2 18, 32, 45, 56, 77, 151
- 4 25, 192, 300, 332, 440
82- The postings lists to access are to, be, or,
not. - We will examine intersecting the postings lists
for to and be. - We first look for documents that contain both
terms. - Then, we look for places in the lists where there
is an occurrence of be with a token index one
higher than a position of to - and then we look for another occurrence of each
word with token index 4 higher than the first
occurrence. - In the above lists, the pattern of occurrences
that is a possible - match is
- to lt...4lt...,429,433gt...gt
- Be lt...4lt...,430,434gt...gt
83Exercise
- Consider the following index
- Language ltd1,12gtltd2,23-32-43gtltd3,53gtltd5,36-42-48gt
- Loria ltd1,25gt ltd2,34-40gt ltd5,38-51gt
- Where dI refers to the document I, the other
numbers being positions. - The infix operator NEAR/x refers to the proximity
x between two term - Give the solutions to the query language NEAR/2
Loria - Give the pairs (x,docids) for each x such that
language NEAR/x Loria has at least one solution - Propose an algorithm for retrieving matching
document for this operator
84Example WESTLAW
- Large commercial system that serves legal and
professional market since 1974 - legal materials (court opinions, statutes,
regulations, ...) - news (newspapers, magazines, journals, ...)
- financial (stock quotes, financial analyses,
...) - Total collection size 5-7 Terabytes
- 700 000 users (they claim 56 of legal searchers
as of 2002) - Best match added in 1992
85WESTLAW query language features
- Boolean and proximity operators
- Phrases West Publishing
- Word Proximity West /5 Publishing
- Same sentence Massachussets /s technology
- Same paragraph - information retrieval /p
- Restrictions DATE(AFTER 1992 BEFORE 1995)
- Term expansion
- wildcard (THOMSON) truncation (THOM!)
automatic expansion of plurals, possessive - Document structure (fields)
86WESTLAW query example
- Information need Information on the legal
theories involved in preventing the disclosure of
trade secrets by employees formerly employed by a
competingcompany. - ? Query "trade secret" /s disclos! /s prevent /s
employe! - Information need Requirements for disabled
people to be able to access a workplace. - ? Query disab! /p access! /s work-site
work-place (employment /3 place) - Information need Cases about a hosts
responsibility for drunk guests. - ? Query host! /p (responsib! liab!) /p
(intoxicat! drunk!) /p guest
87Boolean query languages are not dead
- Exact match still prevalent in the commercial
market (but then includes some type of ranking) - Many users prefer Boolean
- For some queries/collections, boolean may work
better - Boolean and free text queries find different
documents - ? Need retrieval models that support both
88The Vector Space Model
89Best-Match retrieval
- Boolean retrieval is the archetypal example of
exact-match retrieval - Best-match or ranking models are now more common
- Advantages
- easier to use
- similar efficiency
- provides ranking
- best match generally has better retrieval
performance - most relevant documents appear at the top of the
ranking - But comparison best- and exact-match is difficult
90- Boolean model all documents matching the query
are retrieved - The matching is binary yes or no
- Extreme cases the list of retrieved documents
can be empty, or huge - A ranking of the documents matching a query is
needed - A score is computed for each pair (query,
document)
91Vector-space Retrieval
- By far the most common retrieval systems
- Key idea Everything (document, queries) is a
vector in a high dimensional space - Vector coefficients for an object (document,
query, term) represent the degree to which this
object embodies each of the basic dimensions - Relevance is measured using vector similarity a
document is relevant to a query if their
representing vectors are similar
92Vector-space Representation
- Documents are vectors of terms
- Terms are vectors of documents
- A query is a vector of terms
93Graphic Representation
- Example
- D1 2T1 3T2 5T3
- D2 3T1 7T2 T3
- Q 0T1 0T2 2T3
93
94Similarity in the Vector-space
- Vector can contain binary terms or weighted terms
- Binary term vector 1 ? term present, 0 ? term
absent - Weighted term vectorindicates relative
importance of terms in a document - Vector similarity can be measured in several
ways - Inner product (measure of overlap)
- Cosine coefficient
- Jacquard coefficient
- Dice coefficient
- Mikowski metric (dissimilarity)
- Euclidian distance (dissimilarity)
95Using the inner product similarity measure
- Given a query vector q and a doct vector d, both
of length n, - similarity between q and d is defined by the
inner product q d of q and d
- where qi (di ) is the value of the i -th position
of q(d) - With binary values this amounts to counting the
matching terms between q and d
96Similarity an example in the Vector-space
97The effect of varying document lengths
- Problem
- Longer documents will be represented with longer
vectors, but that does not mean they are more
important - If two documents have the same score, the shorter
one should be preferred - Solution the length of a document must be taken
into account when computing the similarity score
98Document length normalization
- The length of a document euclidian length
- If d (x1, x2, ... Xn) then dw
- To normalize a document, we divide it by its own
length d/dw - Similarity given by the cosine measure between
normalized vectors - q?(d/dw)
- One problem is solved shorter more focused
documents receive a higher score than longer
documents with the same matching terms - But shorter documents are generally preferred
over longer one! - More sophisticated weighting schemes are
generally used
99Term weights
- qi is the weight of the term i in q
- Up to now, we only considered binary term weight
- 0 term absent
- 1 term present
- Two shortcomings
- Does not reflect how often a term occurs
- All terms are equally important (president vs.
the) - Remedy use non binary term weights
- tf-score store the frequency of a term in the
vector (e.g., 4 if the term occurs 4 times in the
document) - idf-score to distinguish meaningful terms ie
terms that occur only in a few documents
100Term frequency
- A document is treated as a set of words
- Each word characterizes that document to some
extent - When we have eliminated stop words, the most
frequent words tend to be what the document is
about - Therefore fkd (Nb of occurrences of word k in
document d) will be an important measure. - ? Also called the term frequency (tf)
101Document frequency
- What makes this document distinct from others in
the corpus? - The terms which discriminate best are not those
which occur with high document frequency! - Therefore dk (nb of documents in which word k
occurs) will also be an important measure. - ? Also called the document frequency (idf)
102TF.IDF
- This can all be summarized as
- Words are best discriminators when
- they occur often in this document (term
frequency) - do not occur in a lot of documents (document
frequency) - One very common measure of the importance of a
word to a document is - TF.IDF term frequency x inverse document
frequency - There are multiple formulas for actually
computing this. The underlying concept is the
same in all of them.
103Term weights
- tf-score tfi,j frequency of term i in
document j - idf-score idfi Inversed document frequency of
term i - idfi log(N/ni) with
- N, the size of the document collection (nb of
documents) - ni , the number of documents in which the term i
occurs - idfi Proportion of the document collection in
which termi occurs - Term weight of term i in document j (TF-IDF)
- tfi,j. idfi
- the rarity of a term in the document collection
104Boolean retrieval vs. Vector Space Retrieval
- Boolean retrieval
- Documents are not ranked
- Boolean queries are not easy to manipulate
- Vector space retrieval
- Documents can be ranked
- Issue 1 choice of comparison function. Usually
cosine comparison. - Issue 2 choice of weighing scheme. Usuall
variations on tfi,j. idfi
105Evaluation
106Evaluation
- Issues
- User-based evaluation
- System-based evaluation
- TREC
- Precision and recall
107Evaluation methods
- Two types of evaluation methods
- User-based measures the user satisfaction
- System-based focuses on how well the system
ranks the documents
108User based evaluation
- More direct
- Expensive
- Difficult to do correctly
- Need sufficiently large, representative sample of
users - The compared systems must be equally well
developed (complete with fully fonctional user
interface) - Each user must be trained to control learning
effects - Information, information needs, relevance are
intangible concepts
109System based evaluation
- Good system performance good document rankings
- Allows for fair comparative testing
- Less expensive can be reused
- Test collection Topics, Documents, Relevance
judgments - System based evaluation goes back to Cranfields
experiments (1960) - Rate relevance of retrieved bibliographic
reference on a scale from 1 to 4
110Recall and Precision
- Three important performance metrics
- Precision Proportion of retrieved documents
that are relevant - ? No penalty for selecting too few item
- Recall Proportion of relevant documents that
have been retrieved - ? No penalty for selecting too many items (e.g.,
everything)7
111F-Measure
112Standard Text Collections
- Relevant documents must be identified
- Given a document collection D and a set of
queries Q, RELq is the set of document relevant
to q - Whether a document d is relevant to a query q is
decided by human judgement
113Standard Text Collections
- CACM (computer science) 3024 abstracts, 64
queries - CF (medicine) 1239 abstracts, 100 queries
- CISI (library science) 1460 abstracts, 112
queries - CRANFIELD (aeronautics) 1400 abstracts, 225
queries - LISA (library science) 6004 abstracts, 35
queries - TIME (newspaper) 423 abstracts, 83 queries
- Ohsumed (medicine) 348 566 abstracts, 106 queries
114Building Test Collections
- How to identify relevant documents?
- How to assess relevance? (binary or
finer-grained) - One vs several judges
115TREC
- Text REtrieval Conference
- Proceedings at http//trec.nist.gov/
- Established in 1991 to evaluate large scale IR
- Retrieving documents from a gigabyte collection
- Organised by NIST and run continuously since 1991
- Best known IR evaluation setting
- 25 participants in 92
- 109 participants from 4 continents in 2004
- European (CLEF) and Asian counterparts (NTCIR)7
116TREC Format
- Several IR research tracks
- ad-hoc retrieval
- routing/filtering
- cross languag
- scanned document
- spoken document
- Video
- Web
- question answering
- ...
117TREC notion of relevance
- If you were writing a report on the subject of
the topic and would use the information contained
in the document in the report, then the document
is relevant - Pooling is used for identifying relevant
documents - A set of possibly relevant documents is created
automatically for each information need - The top 100 documents returned by each system are
kept and inspected by judges who determine which
documents are relevant - Inter-judge agreement is about 808
118Improving Recall and Precision
- The two big problems with short queries are
- Synonymy Poor recall results from missing
documents that contain synonyms of search terms,
but not the terms themselves - Polysemy/Homonymy Poor precision results from
search terms that have multiple meanings leading
to the retrieval of non-relevant documents.
119Query Expansion
- Find a way to expand a users query to
automatically include relevant terms (that they
should have included themselves), in an effort to
improve recall - Use a dictionary/thesaurus
- Use relevance feedback
120Thesauri
- A thesaurus contains information about words
(e.g., violin) such as - Synonyms similar words e.g., fiddle
- Hyperonyms more general words e.g., instrument
- Hyponyms more specific words e.g., Stradivari
- Meronyms parts, e.g., strings
- A very popular machine readable thesaurus is
Wordnet
121Problems of Thesauri
- Language dependent
- Available only for a couple of languages
122Cooccurence models
- Semantically or syntactically related terms
- Cooccurence vs. Thesauri
- Easy to adapt to other languages/domains
- Also covers relations not expressed in thesaur
- Not as reliable as manually edited thesauri
- Can introduce considerable noise
- Selection criteria Mutual information, Expected
mutual, information
123Relevance feedback
- Ask user to identify a few documents which appear
to be related to their information need - Extract terms from those documents and add them
to the original query. - Run the new query and present those results to
the user. - Typically converges quickly
124Blind feedback
- Assume that first few documents returned are most
relevant rather than having users identify them - Proceed as for relevance feedback
- Tends to improve recall at the expense of
precision
125Post-Hoc Analysis
- When a set of documents has been returned, they
can be analyzed to improve usefulness in
addressing information need - Grouped by meaning for polysemic queries (using
N-Gram-type approaches) - Grouped by extracted information (Named entities,
for instance) - Group into existing hierarchy if structured
fields available Filtering (e.g., eliminate spam)
126References
- Introduction to Information Retrieval, by C.
Manning, P. Raghavan, and H. Schütze. To appear
at Cambridge University Press (chapters available
at the book website). - Information Retrieval, Second Edition, by C.J.
van Rijsbergen, Butterworths, London, 1979.
Available here.