Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Information Retrieval

Description:

Data Mining Introductory and Advanced Topics by Margaret H. Dunham ... Negation only allowed using BUT to allow efficient use of inverted index by ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 31
Provided by: bert193
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
  • Introduction/Overview
  • Material for these slides obtained from
  • Modern Information Retrieval by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto
    http//www.sims.berkeley.edu/hearst/irbook/
  • Data Mining Introductory and Advanced Topics by
    Margaret H. Dunham
  • http//www.engr.smu.edu/mhd/book

2
Information Retrieval
  • Information Retrieval (IR) retrieving desired
    information from textual data.
  • Library Science
  • Digital Libraries
  • Web Search Engines
  • Traditionally keyword based
  • Sample query
  • Find all documents about data mining.

3
DB vs IR
  • Records (tuples) vs. documents
  • Well defined results vs. fuzzy results
  • DB grew out of files and traditional business
    systesm
  • IR grew out of library science and need to
    categorize/group/access books/articles

4
DB vs IR (contd)
  • Data retrieval
  • which docs contain a set of keywords?
  • Well defined semantics
  • a single erroneous object implies failure!
  • Information retrieval
  • information about a subject or topic
  • semantics is frequently loose
  • small errors are tolerated
  • IR system
  • interpret contents of information items
  • generate a ranking which reflects relevance
  • notion of relevance is most important

5
Motivation
  • IR in the last 20 years
  • classification and categorization
  • systems and languages
  • user interfaces and visualization
  • Still, area was seen as of narrow interest
  • Advent of the Web changed this perception once
    and for all
  • universal repository of knowledge
  • free (low cost) universal access
  • no central editorial board
  • many problems though IR seen as key to finding
    the solutions!

6
Basic Concepts
Logical view of the documents Document
representation viewed as a continuum logical
view of docs might shift
7
The Retrieval Process
8
IR is Fuzzy
Reject
Reject
Accept
Accept
Simple
Fuzzy
9
Information Retrieval
  • Similarity measure of how close a query is to a
    document.
  • Documents which are close enough are retrieved.
  • Metrics
  • Precision Relevant and Retrieved
  • Retrieved
  • Recall Relevant and Retrieved
  • Relevant

10
Indexing
  • IR systems usually adopt index terms to process
    queries
  • Index term
  • a keyword or group of selected words
  • any word (more general)
  • Stemming might be used
  • connect connecting, connection, connections
  • An inverted file is built for the chosen index
    terms

11
Indexing
Docs
Index Terms
doc
match
Ranking
Information Need
query
12
Inverted Files
  • There are two main elements
  • vocabulary set of unique terms
  • Occurrences where those terms appear
  • The occurrences can be recorded as terms or byte
    offsets
  • Using term offset is good to retrieve concepts
    such as proximity, whereas byte offsets allow
    direct access

Vocabulary Occurrences (byte offset)

13
Inverted Files
  • The number of indexed terms is often several
    orders of magnitude smaller when compared to the
    documents size (Mbs vs Gbs)
  • The space consumed by the occurrence list is not
    trivial. Each time the term appears it must be
    added to a list in the inverted file
  • That may lead to a quite considerable index
    overhead

14
Example
  • Text
  • Inverted file

1 6 12 16 18 25
29 36 40 45 54
58 66 70 That house has a
garden. The garden has many flowers. The flowers
are beautiful
Vocabulary
Occurrences
beautiful flowers garden house
70 45, 58 18, 29 6
15
Ranking
  • A ranking is an ordering of the documents
    retrieved that (hopefully) reflects the relevance
    of the documents to the query
  • A ranking is based on fundamental premisses
    regarding the notion of relevance, such as
  • common sets of index terms
  • sharing of weighted terms
  • likelihood of relevance
  • Each set of premisses leads to a distinct IR model

16
Classic IR Models - Basic Concepts
  • Each document represented by a set of
    representative keywords or index terms
  • An index term is a document word useful for
    remembering the document main themes
  • Usually, index terms are nouns because nouns have
    meaning by themselves
  • However, search engines assume that all words are
    index terms (full text representation)

17
Classic IR Models - Basic Concepts
  • The importance of the index terms is represented
    by weights associated to them
  • ki- an index term
  • dj - a document
  • wij - a weight associated with (ki,dj)
  • The weight wij quantifies the importance of the
    index term for describing the document contents

18
Classic IR Models - Basic Concepts
  • t is the total number of index terms
  • K k1, k2, , kt is the set of all index
    terms
  • wij gt 0 is a weight associated with (ki,dj)
  • wij 0 indicates that term does not belong to
    doc
  • dj (w1j, w2j, , wtj) is a weighted vector
    associated with the document dj
  • gi(dj) wij is a function which returns the
    weight associated with pair (ki,dj)

19
The Boolean Model
  • Simple model based on set theory
  • Queries specified as boolean expressions
  • precise semantics and neat formalism
  • Terms are either present or absent. Thus,
    wij ? 0,1
  • Consider
  • q ka ? (kb ? ?kc)
  • qdnf (1,1,1) ? (1,1,0) ? (1,0,0)
  • qcc (1,1,0) is a conjunctive component

20
The Vector Model
  • Use of binary weights is too limiting
  • Non-binary weights provide consideration for
    partial matches
  • These term weights are used to compute a degree
    of similarity between a query and each document
  • Ranked set of documents provides for better
    matching

21
The Vector Model
  • wij gt 0 whenever ki appears in dj
  • wiq gt 0 associated with the pair (ki,q)
  • dj (w1j, w2j, ..., wtj)
  • q (w1q, w2q, ..., wtq)
  • To each term ki is associated a unitary vector
    i
  • The unitary vectors i and j are assumed to be
    orthonormal (i.e., index terms are assumed to
    occur independently within the documents)
  • The t unitary vectors i form an orthonormal
    basis for a t-dimensional space where queries and
    documents are represented as weighted vectors

22
Query Languages
  • Keyword Based
  • Boolean
  • Weighted Boolean
  • Context Based (Phrasal Proximity)
  • Pattern Matching
  • Structural Queries

23
Keyword Based Queries
  • Basic Queries
  • Single word
  • Multiple words
  • Context Queries
  • Phrase
  • Proximity

24
Boolean Queries
  • Keywords combined with Boolean operators
  • OR (e1 OR e2)
  • AND (e1 AND e2)
  • BUT (e1 BUT e2) Satisfy e1 but not e2
  • Negation only allowed using BUT to allow
    efficient use of inverted index by filtering
    another efficiently retrievable set.
  • Naïve users have trouble with Boolean logic.

25
Boolean Retrieval with Inverted Indices
  • Primitive keyword Retrieve containing documents
    using the inverted index.
  • OR Recursively retrieve e1 and e2 and take
    union of results.
  • AND Recursively retrieve e1 and e2 and take
    intersection of results.
  • BUT Recursively retrieve e1 and e2 and take set
    difference of results.

26
Phrasal Queries
  • Retrieve documents with a specific phrase
    (ordered list of contiguous words)
  • information theory
  • May allow intervening stop words and/or stemming.
  • buy camera matches
    buy a camera
    buying the cameras
    etc.

27
Phrasal Retrieval with Inverted Indices
  • Must have an inverted index that also stores
    positions of each keyword in a document.
  • Retrieve documents and positions for each
    individual word, intersect documents, and then
    finally check for ordered contiguity of keyword
    positions.
  • Best to start contiguity check with the least
    common word in the phrase.

28
Proximity Queries
  • List of words with specific maximal distance
    constraints between terms.
  • Example dogs and race within 4 words
    match dogs will begin the race
  • May also perform stemming and/or not count stop
    words.

29
Pattern Matching
  • Allow queries that match strings rather than word
    tokens.
  • Requires more sophisticated data structures and
    algorithms than inverted indices to retrieve
    efficiently.

30
Simple Patterns
  • Prefixes Pattern that matches start of word.
  • anti matches antiquity, antibody, etc.
  • Suffixes Pattern that matches end of word
  • ix matches fix, matrix, etc.
  • Substrings Pattern that matches arbitrary
    subsequence of characters.
  • rapt matches enrapture, velociraptor etc.
  • Ranges Pair of strings that matches any word
    lexicographically (alphabetically) between them.
  • tin to tix matches tip, tire, title,
    etc.
Write a Comment
User Comments (0)
About PowerShow.com