Information Retrieval - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Information Retrieval

Description:

The most of the semantics is carried by the noun words ... For each collection, the right column considers that all words are indexed, ... – PowerPoint PPT presentation

Number of Views:281
Avg rating:3.0/5.0
Slides: 91
Provided by: shyhka
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • Shyh-Kang Jeng
  • Department of Electrical Engineering/
  • Graduate Institute of Communication Engineering
  • National Taiwan University

2
Reference
  • R. Baeza-Yates and B. Ribeiro-Neto, Modern
    Information Retrieval, Addison-Wesley, 1999.

3
Outline
  • Basic concepts
  • Information Retrieval Models
  • Text Property
  • Document Preprocessing
  • Indexing and Searching
  • Searching the Web

4
Information Retrieval Agents
5
Information Retrieval
  • Deals with information
  • Representation
  • Storage
  • Organization
  • Access
  • Provides the user with easy access to the
    information in which he is interested

6
Example of Information Retrieval
  • Find all pages (documents) containing information
    on college tennis teams which
  • are maintained by an university in USA
  • participate in the NCAA tennis tournament
  • To be relevant, the page must includes
  • National ranking of the team in the last three
    years
  • Email of the team coach

7
Information vs. Data Retrieval
  • Information retrieval
  • Results might be inaccurate
  • Small errors are likely to go unnoticed
  • Deals with natural language text which is not
    well structured and could be semantically
    ambiguous
  • Data retrieval
  • Aims at retrieving all objects which satisfy
    clearly defined conditions
  • A single erroneous object among a thousand
    retrieval objects means total failure
  • Has a well defined structure and semantics

8
User Task
  • Retrieval
  • Searches for desired information directly
  • Browsing
  • Still a process of retrieving information
  • Main objectives are not clearly defined in the
    beginning
  • The purpose might change during the interaction
    with the system

9
Interaction with the System
Retrieval
Document Database
Browsing
10
Keywords
  • Queries are often translated to a set of key
    words (or index terms) which summarizes the
    description of the user information needed
  • Documents are also frequently represented through
    a set of index terms or keywords

11
Logical View From Full Text to Set of Index Terms
12
Retrieval Process
13
Ad hoc and Filtering Retrieval
  • Ad hoc retrieval
  • The documents in the collection remain relative
    static while new queries are submitted to the
    system
  • Filtering retrieval
  • The queries remain relatively static while new
    documents come into the system and leave

14
User Profile in Filtering Retrieval
  • User profile
  • Describes the users preferences
  • Filtering
  • Profile is compared to the incoming documents to
    determine those that might be interest to the
    user
  • Ranking
  • Rank the filtered documents and show the ranking
    to the user

15
Constructing User Profile
  • User provides a set of keywords which describes
    an initial profile of preference
  • As new documents arrive, the system uses this
    profile to select documents and show them to the
    user
  • The user indicates not only relevant documents,
    but also non-relevant documents
  • The system uses this information to adjust the
    user profile
  • Profile stabilizes after a while and no longer
    changes drastically unless the users interests
    shift suddenly

16
Information Retrieval Model
  • Quadruple D, Q, F, R(qi, dj)
  • D set composed of logical views for the
    documents in the collection
  • Q set composed of logical views for the user
    information needs
  • F framework for modeling document
    representations, queries, and their relationships
  • R(qi, dj) ranking function defining an ordering
    among the documents with regard to the query qi

17
Boolean Model
18
Vector Model
  • Generic index term ki
  • Set of all index items K k1, . . ., kt
  • Weight wi,j gt 0 is associated with index item ki
    of a document dj
  • Document dj is associated with a vector
  • (w1,j, w2,j, . . ., wt,j )
  • Weight wi,q gt 0 is associated with ki, q
  • Query vector (w1,q, w2,q, . . ., wt,q)

19
Similarity by Vector Model
  • Evaluated as the correlation between and
  • The correlation is quantified by the cosine of
    the angle between two vectors

20
An Effective Term Weighting Scheme
  • total number of documents
  • number of documents where ki appears
  • raw frequency of term ki in dj
  • Normalized frequency
  • Inverse document frequency

21
tf-idf Scheme and Salton-Buckley Query Weighting
  • tf-idf scheme
  • Salton-Buckley query weighting

22
Recall and Precision
  • Recall
  • Fraction of the relevant documents which has been
    retrieved Recall Ra/R
  • Precision
  • Fraction of the retrieved documents which is
    relevant Precision Ra/A

Relevant Docs R
Collection
Relevant Docs in Answer Set Ra
Answer Set
A
23
Example
  • Set containing relevant documents for query q
  • Rq d3, d5, d9, d25, d39, d44, d56, d71, d89,
    d123
  • Ranking of the retrieved documents
  • 1. d123 6. d9 11. d38
  • 2. d84 7. d511 12. d48
  • 3. d56 8. d129 13. d250
  • 4. d6 9. d187 14. d113
  • 5. d8 10. d25 15. d3

24
Precision and Recall Figure
25
User Relevance Feedback
  • The user is presented with a list of the
    retrieved documents
  • After examining them, the user marks those which
    are relevant
  • In practice, only the top 10 (or 20) ranked
    documents need to be examined
  • Select important terms attached to the documents
    marked relevant
  • Enhance the importance of these terms in a new
    query formulation
  • The new query will be moved towards the relevant
    documents and away from the non-relevant ones

26
Term Reweighting
  • Standard Rochio
  • set of relevant documents, as identified by
    the user, among the retrieved documents
  • set of non-relevant documents, as
    identified by the user, among the retrieved
    documents

27
Modeling of Natural Language Zipfs Law
  • In a text of words with a vocabulary of
    words, the i-th frequent word appears
  • times, where

F
Words
28
Modeling of Natural Language Heaps Law
  • The vocabulary of a text of size words is of
    size

29
Lexical Analysis of the Text
30
Elimination of Stopwords
  • Stopwords
  • Words too frequent among the documents in the
    collection
  • Not good discriminators
  • Articles, prepositions, conjunctions, and some
    variables, adverbs, and adjectives are natural
    candidates for a list of stopwords
  • Elimination of stopwords
  • Reduces the size of the index structure
    considerably (40 or more is typical)
  • Counter example to be or not to be

31
Stemming
  • Stem
  • Portion of a word which is left after the removal
    of its affixes (i.e. prefixes and suffixes like
    plurals, gerund forms, and past tense suffixes)
  • Stemming
  • Substitute the words by their respective stems
  • Useful for improving retrieval performance
  • Can reduce the size of index structure
  • Controversy in literatures about the benefits
  • Porter algorithm is often used for suffix
    stripping

32
Noun Groups
  • The most of the semantics is carried by the noun
    words
  • Selects nouns as index terms through systematic
    elimination of verbs, adjectives, adverbs,
    connectives, articles, and pronouns
  • Common to combine two or three nouns in a single
    component (e.g., computer science)
  • Makes sense to cluster nouns which appear near by
    into a single indexing component
  • Noun group is a set of nouns with no more 3 (or a
    predetermined threshold) words between any two
    nouns

33
Thesauri
  • Refers to a treasury of words consisting of
  • A precompiled list of important words
  • For each word in the list, a set of related words
  • Complemented with a definition or an explanation
  • Purposes
  • Provide a standard vocabulary for indexing and
    searching
  • Assists users with locating terms for proper
    query formulation
  • Provides classified hierarchies that allow the
    broadening and narrowing of the current query

34
Inverted Files
  • A word-oriented mechanism for indexing a text
    collection in order to speed up the searching
    task
  • Structure
  • Vocabulary
  • Occurrence
  • The space required for the vocabulary is rather
    small, according to Heaps law
  • The occurrences need extra space

35
Example of an Inverted Index
Inverted Index
36
Inverted Index using Block Addressing
This is a text. A text has many words.
Block 1
Block 2
Block 3
Words are made from letters.
Text
Block 4
Inverted Index
37
Block Considerations
  • Blocks can be of fixed size
  • Or be defined using the natural division of the
    text collection into files, documents, web pages,
    etc.

38
Effect of Block Sizes
For each collection, the right column considers
that all words are indexed, While the left column
considers that stopwords are not indexed
39
Searching with Inverted Files
  • Vocabulary search
  • Better to have vocabulary in a separated file
  • Vocabulary file fits in main memory in most case
  • Retrieval of occurrences
  • Manipulation of occurrences
  • If block addressing is used, it may be necessary
    to directly search the text to find the
    information missing from the occurrences (e.g.,
    exact word position)
  • Sublinear search time and sublinear space
    requirements

40
Constructing a Vocabulary Trie
letters 60
made 50
d
l
Vocabulary trie
m
a
many 28
n
t
text 11, 19
w
words 33, 40
41
Building an Inverted Index
  • Once the text is exhausted, the trie is written
    to disk together with the list of occurrence
  • Split the index into two files
  • First file lists of occurrences are stored
    contiguously
  • Second file vocabulary is stored in
    lexicographical order and, for each word, a
    pointer to its list in the first file is also
    included

42
Inverted Index for Large Texts
  • If the index does not fit in main memory, the
    partial index Ii obtained up to now is written to
    disk and erased from main memory before
    continuing with rest of the text
  • Finally, a number of partial indices Ii exists on
    disk. These indices are then merged in a
    hierarchical manner

43
Merging the Partial Indices
I-1. .8
7
I-1. .4
I-5. .8
5
6
I-1. .2
I-3. .4
I-5. .6
I-7. .8
1
2
3
4
I-1
I-2
I-3
I-4
I-5
I-6
I-7
I-8
44
Suffix Trees and Suffix Arrays
  • Queries such as phrases are expensive to solve
    using inverted indices
  • Concept of word does not exist in some
    applications such as genetic databases
  • Suffix trees and suffix arrays are suitable for a
    wider spectrum of applications
  • For word-based applications, inverted files
    perform better unless complex queries are an
    important issue

45
Suffixes
This is a text. A text has many words.
Words are made from letters.
Text
text. A text has many words. Words are made from
letters. text has many words. Words are made
from letters. many words. Words are made from
letters. words. Words are made from
letters. Words are made from letters. made from
letters. letters.
Suffixes
46
Suffix Trie
60
50
d
l
a
m
n
28
19

t
e
x
t
w
.
11

40
o
r
d
s
.
33
47
Suffix Tree
60
50
l
d
3
m
n
28
19
1

t
5
w
.
11

40
6
.
33
48
Suffix Array
60
50
28
19
11
40
33
49
Supra-index over Suffix Array
lett
text
word
60
50
28
19
11
40
33
50
Vocabulary Supra-index vs. Inverted List
letters
made
many
text
words
60
50
28
19
11
40
33
Suffix Array
Inverted list
60
50
28
11
19
33
40
51
Searching Using Suffix Arrays
  • The search pattern originates two limiting
    patterns
  • and so that we want any suffix
    such that
  • First binary search both limiting patterns in the
    suffix array
  • All the elements lying between both positions
    point to exactly those suffixes that start like
    the original pattern, i.e., to the pattern
    positions in the text
  • A simple phrase can be searched as if it was a
    simple pattern

52
Sequential Searching for Exact String Matching
  • Given a short pattern P of length m and a long
    text T of length n
  • Find all the text position where the pattern
    occurs
  • With no data structure being built on the text
  • Assume that the text and the pattern are
    sequences of characters drawn from an alphabet of
    size s, whose first character is at position 1

53
Brute Force
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
a
a
b
a
a
b
r
a
c
a
d
a
b
r
a
Worst case O(mn), Average case O(n)
54
Knuth-Morris-Prattthe next Function
4
0
0
0
0
0
0
0
0
0
1
1
next
a
b
r
a
c
a
d
a
b
r
a
55
Knuth-Morris-PrattExample
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
a
b
r
a
c
a
d
a
b
r
a
c
a
d
a
b
r
a
Linear worst case behavior, but no faster
than brute force on average
56
Boyer-Moore Heuristics
Match heuristic 3
a
b
r
a
c
a
d
a
b
r
a
Occurrence heuristic 5
a
b
r
a
c
a
d
a
b
r
a
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
57
Boyer-Moore Example
b
r
a
c
a
b
r
a
c
a
d
a
b
r
a
a
r
a
a
b
r
a
c
a
d
a
b
r
a
a
O(nlog(m)/m) on average, worst case is
O(mn) Fastest in general
58
Approximate String Matching
  • Given a short pattern P of length m, a long text
    T of length n, and a maximum allowed number of
    errors k, find all the text positions where the
    pattern occurs with at most k errors

59
Similarity
  • Similarity is measured by a distance function
  • Hamming distance
  • Number of positions that have different
    characters
  • Should be symmetric and satisfy triangular
    inequality

60
Levenshtein Distance (Edit Distance)
  • Minimum number of character insertions,
    deletions, and replacements to make two strings
    equal
  • Examples
  • distance(color, colour) 1
  • distance(survey, surgery) 2

61
Dynamic Programming for Approximate String
Matching
  • A matrix C0..m, 0..n is filled column by
    column, where CI,j represents the minimum
    number errors needed to match P1..i to a suffix
    T1..j
  • Computed as
  • C0,j 0
  • CI,0 i
  • CI,j if( Pi Tj ) then Ci-1,j-1
  • else 1 min( Ci-1,j, CI,j-1,
    Ci-1,j-1 )

62
Dynamic Programming Example
T
P
63
Structured Text Retrieval
  • Queries combine the patterns with the
    specification of structural components of the
    component
  • Example
  • Same-page( near (atom holocaust, Figure(label
    (earth) ) ) )

64
Non-Overlapping Lists
Chapter
L0
Sections
L1
Subsections
L2
L3
Subsubsections
65
Non-Overlapping Lists
  • A single inverted file is built in which each
    structural component stands as an entry in the
    index
  • Associated with each entry there is a list of
    text regions as a list of occurrences
  • Such a list could be easily merged with the
    traditional inverted file for the words in the
    text

66
Proximal Nodes
Chapter
Sections
Subsections
Subsubsections
holocaust
10
256
48324
67
Proximal Nodes Simple Query Processing Strategy
  • Traverse the inverted list for the term
  • For each entry in the list, search the
    hierarchical index looking for chapter, sections,
    subsections, and subsubsections containing that
    occurrence of the term

68
Proximal Nodes Sophisticated Strategy
  • For the first entry in the list, search the
    hierarchical index as before, until no more
    successful matches occur
  • Verify whether the innermost matching component
    also matches the second entry in the list
  • Proceed then to the third entry in the list, and
    so on

69
Text in Sequence
  • Written text is usually conceived to be read
    sequentially
  • A sequenced organizational structure lies
    underneath most written text
  • Sometimes we are looking for information not
    easily captured through sequential reading
  • Example
  • A book about the history of war organized
    chronologically
  • We want to know regional wars in Europe

70
Hypertext
  • A high level interactive navigational structure
    which allows us to browse text non-sequentially
    on a computer screen
  • Basically a directed graph structure
  • Basis for HTML and HTTP, which originated the
    World Wide Web

71
World Wide Web
  • Can be seen as a very large, unstructured but
    ubiquitous database
  • Triggers the need for efficient tools to manage,
    retrieve, and filter information from it
  • Those tools are also important in large
    intranets, to extract or infer new information to
    support a decision process, a task called data
    mining

72
Searching the Web
  • Forms
  • Use search engines
  • Use Web directories
  • Exploit hyperlink structure
  • Challenges
  • Distributed data
  • High percentage of volatile data
  • Large volume
  • Unstructured and redundant data
  • Quality of data
  • Heterogeneous data

73
Problems Regarding the User and Interaction
  • How to specify a query?
  • How to interpret the answer provided by the
    retrieval system?
  • How do we handle a large answer?
  • How do we rank the documents?
  • How do we select the documents?
  • How do we browse efficiently in large documents?

74
Measuring the Web (1999)
  • There are more than 40 millions computers in more
    than 200 countries connected to the Internet
  • Estimated number of Web servers ranges from 2.4
    million to over three million
  • Estimated number of Web pages ranges from 200 to
    320 million, growing at a rate of 20 million
    pages per month
  • Estimated that 30,000 largest Web sites (about 1
    of the Web) account for approximately 50 of all
    Web pages

75
Measuring the Web (1999)
  • An average page has between 5 and 15 hyperlinks,
    and most of them are local
  • Most Web pages are HTML pages
  • Assume that the average HTML page has 5KB, and
    that there are 300 million Web pages, we have at
    least 1.5 terabytes of text
  • Total number of languages exceeds 100

76
Languages of the Web
77
Models of the Web
  • Heaps and Zipfs laws are also valid in the Web
  • Probability of finding a document of size x bytes
  • 93 of all the files have a size below 9.3 KB

78
Distribution of All File Size (1998)
79
Right Tail Distribution for Different File Types
(1996)
80
Search Engines
  • In the web all queries must be answered without
    accessing the text. That is, only the indices
    are available.
  • Otherwise,
  • Store locally of a copy of the web pages (too
    expansive)
  • Access remote pages through the network at query
    time (too slow)

81
Searching Engine Centralized Architecture
  • Crawlers are programs (software agents) that
    traverse the web sending new or updated pages to
    a main server where they are indexed.
  • Crawlers are also called robots, spiders,
    wanderers, walkers, and knowbots
  • A crawler does not actually move to and run on
    remote machines
  • The index is used in a centralized fashion to
    answer queries submitted from different places in
    the Web

82
Searching Engine Centralized Architecture
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
83
Searching Engine Centralized Architecture
  • Main problems
  • Gathering of the data (highly dynamic)
  • Saturated communication links
  • High load at web servers
  • Volume of the data
  • May not be able to cope with Web growth in the
    near future
  • Good load balancing internally (answering queries
    and indexing) and externally (crawling) are
    important

84
Page Ranking
  • Most search engines use variations of the Boolean
    or vector model
  • To be performed without accessing the text, just
    the index
  • The vector model yields a better recall-precision
    curve, with an average precision of 75 in a
    study
  • Some new algorithms also use hyperlink
    information and achieve even better results

85
Crawling the Web
  • Starts with a set of URLs and from there extract
    other URLs which are followed recursively in a
    breadth-first or depth-first fashion
  • Allows users to submit top Web sites that will be
    added to the URL set
  • Or starts with a set of popular URLs
  • Difficult to coordinate several crawlers to avoid
    visiting the same page more than once
  • Or partitions the Web using country codes or
    Internet names

86
Indices
  • Dynamically generated pages can not be indexed as
    well as password protected pages
  • Most indices use variants of the inverted file
  • Some use elimination of stopwords to reduce the
    size of the index
  • Is complemented with a short description of each
    Web page
  • A query is answered by doing a binary search on
    the sorted list of words of the inverted file
  • Block addressing is used by some search engines

87
Web Directories
  • As a browsing tool. Yahoo! is an example
  • Also called catalogs, yellow pages, or subject
    directories
  • In most cases, pages have to be submitted to the
    Web directory, where they are reviewed and
    classified
  • Classification is often done manually
  • Can afford to have a copy of all classified pages
  • Most also send query to a search engine

88
Metasearchers
  • Web servers that send a given query to several
    search engines, Web directories and other
    databases, collect the answers and unify them
  • Examples like Metacrawler and SavvySearch
  • Differs in how ranking is performed in the
    unified result
  • Metasearchers for specific topics can be
    considered as software agents

89
Dynamic Search
  • Use an online search to discover relevant
    information by following links
  • Slow, but might be used in small and dynamic
    subsets of the web
  • Fish search
  • Exploit the intuition that relevant documents
    often have neighbors that are relevant
  • At each step, the page with highest priority is
    analyzed. If relevant, a heuristic decides to
    follow or not to follow the links on that page

90
Software Agents
  • For searching specific information on the Web
  • Deals with heterogeneous sources of information
    which have to be combined
  • Important issues
  • How to determine relevant sources
  • How to merge the results retrieved (the fusion
    problem)
Write a Comment
User Comments (0)
About PowerShow.com