INFO624 Week 2 Models of Information Retrieval - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

INFO624 Week 2 Models of Information Retrieval

Description:

AOL PLS Search Engine (free) GreenStone Digital Library Software (open-source) ... mnoGoSearch (free) Apache Lucene (open source ... Weights in the Vector Space ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 41
Provided by: xia52
Category:

less

Transcript and Presenter's Notes

Title: INFO624 Week 2 Models of Information Retrieval


1
INFO624 - Week 2Models of Information Retrieval
  • Dr. Xia Lin
  • Associate Professor
  • College of Information Science and Technology
  • Drexel University

2
Reviews of Last Week
  • Challenges of Information Retrieval
  • Translate users information needs to queries.
  • Match queries to stored information.
  • Evaluate if the query results match the users
    information needs
  • Differences between
  • Data, information, and knowledge
  • Data retrieval and information retrieval

3
Assignment 1
  • Some of my favorite Search Software Packages
  • IBM Content Management (high-cost)
  • AOL PLS Search Engine (free)
  • GreenStone Digital Library Software (open-source)
  • SWISH (open source)
  • mnoGoSearch (free)
  • Apache Lucene (open source components)

4
Documents
  • Documents are logical units of text
  • Units of records (text other components)
  • Units that can be stored, retrieved, and
    displayed as an unique entity
  • Units of semantic entity
  • units of text grouped together for a purpose
  • Units of unformatted text
  • Text as written by authors of documents.

5
Document Models
  • Documents need to be processed and represented in
    a concise and identifiable formats/structures.
  • Documents are full of text.
  • Not every words of the text are meaningful for
    searching/retrieval.
  • Documents themselves do not have identifiable
    attributes such as authors and titles.

6
Figure 1.2 Logical view of a document from full
text to a set of index terms.
7
Document Representation
  • Documents should be represented to help users
    identify and receive information from the system.
  • to identify authors and titles
  • to identify subjects
  • to provide summaries/abstracts
  • to classify subject categories

8
Document Surrogates
  • Each document should have one or more short and
    descriptive labels/attributes
  • Level 1
  • Title
  • Author
  • Keywords
  • Level 2
  • Level 1 Abstract
  • Level 3
  • Level 2 full text

9
A Formal IR Models
  • An information retrieval model is a quadruple (D,
    Q, F, R(qi, dj)) where
  • D is a set composed of logical views (or
    representations) for the documents in the
    collection.
  • Q is a set composed of logical views (or
    representations) for the information needs. Such
    representations are called queries.
  • F is a framework for modeling document
    representations, queries, and their relationships
  • R(qi, dj) is a ranking function which associated
    a real number with a queryqi and a document
    representation dj. Scuh ranking defines an
    ordering among the documents with regard to the
    query qi.

10
Computerized Indexing
  • Title indexing
  • Sort all the titles alphabetically
  • Not consider the beginning a or the
  • Convert all letters to uppercases.
  • Matching always starts from the beginning of the
    title (not individual words).
  • Most early IR systems (such as library catalogs)
    used title indexing

11
Word indexing
  • Parsing every individual words from documents
  • First decision What is a word?
  • Are digits words?
  • How about the letter and digit combination B6,
    B12
  • Is F-16 one word or two words?
  • Hyphens
  • Online, on-line, on line ?
  • F-16
  • Singular or plural ?
  • List all the words alphabetically with points
    back to documents inverted indexing.

12
Inverted Indexing
  • Inverted indexing consists of an ordered list of
    indexing terms, each indexing term is associated
    with some document identification numbers.
  • Retrieval is done by first searching in the
    ordered list to find the indexing term, then
    using the document identification numbers to
    locate documents

13
Example Create an inverted indexing for the
following
14
Boolean Logic
  • Logical operators defined on sets
  • True and false
  • A set is a collection of items with certain
    common characteristics.
  • Any item either belongs to the set (true) or not
    belong to the set (false)
  • AND
  • combine two sets, A and B, to create a smaller
    (or at least not larger) set C.
  • any items in C must be in BOTH set A and set B.
  • OR
  • Union of two sets, A and B, to create a larger
    set C.
  • any item in C must be either in set A or in set
    B.
  • Not
  • to exclude items in a set.

15
Example
  • Given
  • A1, 3, 7, 12, 14, 25,36,
  • B1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26
  • C2,4,6,8,10,11,12,13,14
  • Derive
  • A AND B
  • A OR B
  • A AND B AND C
  • (A AND B) NOT C
  • (A AND B) OR C
  • (A OR B) AND C
  • A AND (B OR C)

16
Boolean Logic
  • Venn Diagram
  • graphical representation of Boolean logic
  • A and (B or C)
  • A and B or (C and D)

17
Boolean Query
  • Terms connected by Boolean operators
  • The system retrieves a set of documents based on
    the Boolean logic of the query.
  • Examples
  • (network or networks or structured or system or
    systems) and (information or retrieval)

18
Advantages of Boolean Search
  • Simple and specific
  • Effective
  • AND reduces the number of hits very quickly
  • OR expands search scope
  • Strong logic-based
  • proved mathematical foundations

19
Problems of Boolean Search
  • Boolean search is an exact search
  • either retrieving or not retrieving a document.
  • Requesting computer would not find computing
    unless more programming is done
  • No weighting can be done on terms
  • in query, A and B, you cant specify A is more
    important than B.

20
  • No Ranking
  • Retrieved sets can not be ordered based on the
    Boolean logic.
  • Every retrieved document are treated equally.
  • Possible order confusion
  • A AND B OR C

21
Vectors
  • A numerical representation for a point in a
    multi-dimensional space.
  • (x1, x2, xn)
  • Dimensions of the space need to be defined
  • A measure of the space needs to be defined.

22
Vector Representation of Document Space
  • Each indexing term is a dimension
  • Each document is a vector
  • Di (ti1, ti2, ti3, ti4, ... tin)
  • Dj (tj1, tj2, dj3, tj4, ..., tjn)
  • Document similarity is defined as

23
Example
  • A document Space is defined by three terms
  • hardware, software, user
  • A set of documents are defined as
  • A1(1, 0, 0), A2(0, 1, 0), A3(0, 0, 1)
  • A4(1, 1, 0), A5(1, 0, 1), A6(0, 1, 1)
  • A7(1, 1, 1) A8(1, 0, 1). A9(0, 1, 1)
  • If the Query is hardware and software
  • what documents should be retrieved?

24
  • In Boolean query matching
  • document A4, A7 will be retrieved by ANDing the
    two query terms
  • retrievedA1, A2, A4, A5, A6, A7, A8, A9 if two
    query terms are ORed together.
  • In Vector query matching
  • q(1, 1, 0)
  • S(q, A1)0.71, S(q, A2)0.71, S(q, A3)0
  • S(q, A4)1, S(q, A5)0.5, S(q, A6)0.5
  • S(q, A7)0.82, S(q, A8)0.5, S(q, A9)0.5
  • Document retrieved set (with order)
  • A4, A7, A1, A2, A5, A6, A8, A9

25
Weights in the Vector Space
  • A main advantage of Vector representation is that
    items in vectors dont have to be just 0 or 1
    (true or false).
  • A1(0.7, 0.5, 0.3)
  • A2(0.5, 0.2, 0.7)
  • A3(0.3, 0.6, 0.9)
  • A4(0.7, 0.9, 1.0)
  • Queries may also be weighted
  • Q(0.7, 0.3, 0)

26
TF and IDF
  • TF term frequency
  • number of times a term occurs in a document
  • DF Document frequency
  • Number of documents that contain the term.
  • IDF inversed document frequency
  • log(N/ni)
  • N the total number of documents
  • ni number of documents that contains term i.

27
Saltons Vector Space
  • A document is represented as a vector
  • (W1, W2, , Wn)
  • Binary
  • Wi 1 if the corresponding term is in the
    document
  • Wi 0 if the term is not in the document
  • TF (Term Frequency)
  • Wi tfi where tfi is the number of times the
    term occurred in the document
  • TFIDF (Inverse Document Frequency)
  • Wi tfiidfitfi(1log(N/dfi)) where dfi is the
    number of documents contains the term i, and N
    the total number of documents in the collection.

28
  • In vector space, documents and queries are
    treated the same.
  • It is easier to do similarity search
  • find documents like this one
  • It is easier to do document clusters
  • group documents into categories and
    subcategories
  • Its easier to display search results graphically
  • Giving meaning to place or location in the
    multi-dimensional space

29
Web Indexing
  • Most web indexing is Vector-based indexing, with
    variances
  • robot indexing software keeps traverse the web to
    collect more pages and terms
  • Servers establish a huge inverted indexing and
    vector indexing database
  • Search engines conduct different types of vector
    query matching
  • only a few search engines implement truly Boolean
    query matching

30
  • The real differences among different search
    engines are
  • their indexing weight schemes
  • their query process methods
  • their ranking algorithms
  • None of these are published by any of the search
    engines firms.

31
Alternative IR Models
  • Probabilistic Model
  • Given a document d, how likely would the user
    consider it relevant?
  • How likely would the user consider it no
    relevant?
  • If these two are known, Similarity of document d
    and query q can be defined as
  • S(d, q) probability of d is relevant to q
  • probability of d is not relevant to q

32
Examples
  • If a document is 80 likely to be relevant to
    query q, what is its (probabilistic) similarity?
  • If a document is only 30 likely to be relevant,
    what is the similarity?

33
  • If there are 100 documents, 10 are relevant to a
    query,
  • what is the probability of relevance for a
    randomly select document?
  • What is the similarity of this document to the
    query?
  • Any retrieve systems must do must better than
    that.
  • In general, retrieval systems should retrieve
    those Sgt1

34
  • Advantages of the Probabilistic model
  • Documents can be ranked by its relevance
    probability.
  • Relevance probability can be improved through the
    interaction process.
  • Good mathematic model
  • Disadvantages
  • Involved many assumptions
  • Not very practical

35
Fuzzy Set Model
  • Fuzzy Set Theory
  • Extension of Boolean set theory
  • Instead of a binary membership definition, fuzzy
    set Membership is continuously defined between 0
    and 1.
  • Example
  • Male students in our class
  • tall students in our class
  • One is Boolean set and one is fuzzy set.

36
  • The set of retrieved documents should be
    considered as a fuzzy set.
  • Documents are not just relevant or not-relevant.
  • Documents can be somehow relevant.
  • Documents can be 80 likely to be relevant.
  • Good Mathematical Models but not widely
    implemented and tested.

37
Latent Semantic Indexing Model
  • Map documents from a high-dimensional space to a
    lower dimensional space, while maintaining
    document relationships.
  • For clustering
  • For visualization
  • Its a popular advanced retrieval technique.
  • Its computationally expensive.

38
Neural Network Model
  • Organize the document collection as a semantic
    network through learning
  • Use known queries/relevant documents to to train
    the network, and later allow the network to
    predict relevance for new queries. (supervised
    learning)
  • Use document-document relationships to
    self-organize the network and move relevant
    documents close to each other. (un-supervised
    learning).

39
The Fusion Model
  • Retrieve documents based on text indexing
    (Boolean model or Vector Space Model, etc.)
  • Retrieve documents based on link models
    (Citations, Googles PageLink, etc.)\
  • Retrieve documents based on classification models
    (The classification schemes, thesauri, Yahoo
    categories, etc).
  • Fusion results together before response to the
    user

40
Models for Browsing
  • Flat Model
  • No particular organizations of materials
  • Hierarchical model
  • Assign documents into a hierarchical structure.
  • Hypertext Model
  • Define appropriate links among related documents.
Write a Comment
User Comments (0)
About PowerShow.com