Semantically Motivated Information Retrieval - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Semantically Motivated Information Retrieval

Description:

Construct the document vector. Will perform better in some cases. Example ... Constructing Better Document Vectors Using Universal Networking Language (UNL) ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 25
Provided by: rohitashw
Category:

less

Transcript and Presenter's Notes

Title: Semantically Motivated Information Retrieval


1
Semantically Motivated Information Retrieval
  • B.Tech Project First Stage

Rohitashwa Bhotica (04005010) Under the guidance
of - Prof. Pushpak Bhattacharyya
2
OUTLINE
  • Introduction
  • Information Retrieval
  • Universal Networking Language (UNL)
  • Vector Construction using UNL
  • Latent Semantic Indexing (LSI)
  • Term Document Matrix
  • Singular Value Decomposition
  • Advantages and Disadvantages
  • Conclusion and Future Work

3
INTRODUCTION
  • Information retrieved by matching terms of a
    document is not that effective
  • Terms might not match in similar sentences
  • We should take the dependencies between words
    also into account
  • Retrieve on basis of conceptual meaning
  • We learn how UNL and LSI can help us in
    retrieving information using the semantic
    relations between the words

4
Information Retrieval
  • Encode a document in a vector
  • Document is an element in vector space
  • Fact that words are semantically related is not
    considered for representation
  • Term Frequency (TF)
  • Component of the vector is the frequency of the
    word in that particular document

5
Information Retrieval (contd.)
  • Inverse Document Frequency (IDF)
  • IDF of term t is given as -
  • Where N is the number of documents and Nt is the
    number of documents having term t
  • TFIDF method
  • Both TF and IDF scores are used to calculate
    value
  • We do not consider the redundant words (for e.g.
    and, for) while constructing the vector

6
Example
  • Consider the sentences -
  • S1 Michael eats an apple standing beside the
    tree
  • S2 The apple tree stands besides Michaels
    house
  • The document vectors using TF are -
  • VS1 lt1,1,1,1,1,1,0gt and VS2 lt1,0,1,1,1,1,1gt
  • Problem - Sentences have same words but mean
    different things
  • Not considering semantic relations of the words
  • We now see a better representation using UNL

7
Universal Networking Language
  • Artificial Language replicates functions of
    natural languages in communication
  • Enables processing of information and knowledge
    across language barriers
  • Provides a linguistic interface to computers
  • Expression is a hyper-graph representing any
    sentence in the form of a semantic network
  • Expression is well-defined and unambiguous

8
UNL (contd.)
  • Composed of Universal words (UWs) which are
    linked by relations to form sentences
  • Universal Words (UWs) are the nodes in the graph
    and constitute the vocabulary of UNL
  • Relations define objectivity information of
    sentences linking two UWs to each other
  • Attributes show the speakers point of view and
    describe the subjectivity information

9
Example
UNL Graph for The bachelor books a room for 2
people.
10
Vector Construction using UNL
  • UWs are used as components of document vector
  • Weight of node dependent on number of links on it
  • Relations are divided into four categories-
  • Transferable - Weight of parent added to child
    node. e.g. obj, agt
  • Equal Weight - Weight of parent equal to weight
    of child. e.g. coo, or
  • Partial Transferable - Link is given weight of
    2. e.g. plt, to, via
  • Non Transferable - Weight is not transferred
    from parent. e.g. qua
  • Semantic information present in sentence is thus
    used now to construct the document vector

11
Method of Construction
  • TFIDF can also be used. The document is
    constructed from UNL by these steps -
  • Construct a UNL graph from the document
  • Count the links to and from each UW in the graph
  • Adjust count depending upon connecting relation
  • Add up all the counts for the UW for the document
  • Multiply this with the IDF
  • Construct the document vector
  • Will perform better in some cases

12
Example
  • Graph of John is going to school eating an apple
  • The vector is given as V lt4, 3, 2, 3, 4gt
    resulting from previous construction

13
Latent Semantic Indexing
  • Variant of the vector retrieval method
  • Dependencies between words taken into account
  • Information is thus retrieved on basis of
    conceptual topic and meaning of document
  • Query and document can have high similarity if
    they do not share any words but mean the same
    thing
  • user and interface have high co-occurrence with
    HCI and interaction even though no terms are
    common
  • Overcomes problems in lexical matching such as
    synonymy and polysemy

14
Latent Semantic Indexing (contd.)
  • Uses statistically derived conceptual indices
  • Documents having many words in common are
    considered to be semantically close
  • Pure mathematical approach
  • Does not understand meaning of document
  • Makes no use of word order, syntactic relations
    or logic, or of morphology
  • Considers only content words while indexing

15
Search for Content Words
  • One method of generating content words is -
  • List all the words that appear in the documents
  • Discard the articles, prepositions and
    conjunctions
  • Discard common words such as (know, see, do)
  • Discard the pronouns
  • Discard frilly words such as (therefore, thus,
    etc)
  • Remove the words that appear in all documents
  • Remove words which occur in only one document
  • We thus get the Term Document Matrix

16
Example
  • Treasury Secretary Paul O'Neill expressed
    irritation on Wednesday that European countries
    have refused to go along with a U.S. proposal to
    boost the amount of direct grants rich nations
    offer poor countries
  • Step 1 Remove the formatting etc.
  • treasury secretary paul o'neill expressed
    irritation wednesday that european countries have
    refused to go along with a us proposal to boost
    the amount of direct grants rich nations offer
    poor countries
  • Step 2 Remove the common english words. We now
    get -
  • treasury secretary paul o'neill expressed
    irritation european countries refused US proposal
    boost direct grants rich nations poor countries
  • Step 3 - Remove the plurals etc. Stem the
    documents. Now left with
  • countri (2) direct europ express grant irritat
    nation o'neill paul poor propos refus rich
    secretar treasuri US

17
Term Document Matrix
  • Each row stands for word and column for text
    passage
  • Each cell will contain the frequency or (TFIDF)
    of the occurrence of word in particular document
  • All documents are normalized
  • The matrix is then solved by SVD when we get
    query
  • We try to reduce false alarms and misses which
    are types of retrieval failures where we find
    either irrelevant or miss some relevant documents
  • LSI reduces dimensionality of an IR problem

18
Singular Value Decomposition
  • Breaks matrix into set of smaller components
  • Dimensionality of matrix is reduced
  • Noise of the term document matrix is lost
  • Similar things become similar and vice versa
  • A SVD of any M x N matrix A is any factorization
    of the form
  • A U?VT
  • U is an M x M orthogonal matrix, V is a N x N
    orthogonal matrix and ? is an M x N diagonal
    matrix. ? contains singular values of A in
    descending order

19
Example
  • We can see that U and V are unit length and
    orthogonal
  • We can restrict U, V and ? to its first k rows
    and get Ã
  • Ã is best square approximation of A by matrix of
    rank k
  • Co-occurring terms will be matched on same
    dimension
  • Thus similarity is increased in representation

20
Example
  • Matrix B2xd E2x2 DT2xd of documents with k 2
  • Matrix of document correlations ETE where E is B
    with length normalized columns.

21
User Queries and Updating
  • Query must be represented as k dimensional vector
  • Then compared to all document vectors which are
    then ranked by their similarity to query
  • Would not want to compute SVD on all updates
  • Use an existing SVD to represent information
  • Compute SVD once and then rest are folded-in
  • Equation for folding-in documents is -
  • A TSDT
  • TT A TT TSDT
  • TT A SDT
  • T is the term matrix and A is the vector to be
    updated

22
Advantages and Disadvantages
  • Advantages
  • Better representation of documents and queries
  • Problems due to polysemy and synonymy are thus
    avoided as we look at semantic relations
  • Disadvantages
  • SVD representation takes up more space than a
    sparse vector representation
  • SVD computation has high pre-processing costs

23
Conclusion and Future Work
  • Features of UNL and LSI can improve information
    retrieval
  • SVD helps us in calculating similarity between
    documents and terms
  • Further on want to implement this.
  • Classify documents of a Wikipedia dump using
    features of UNL and LSI

24
BIBLIOGRAPHY
  • Universal Networking Language (UNL)
    Specifications Version 2005
  • Shah, Choudhary, Bhattacharyya. Constructing
    Better Document Vectors Using Universal
    Networking Language (UNL). Mumbai, India, 2002
  • Mukerjee, Raina, Kapil, Goyal, Shukla. Universal
    Networking Language A Tool for
    Language-Independent Semantics. Kanpur, India
  • Landauer, Foltz, Laham. An Introduction to Latent
    Semantic Analysis
  • Barbara Rosario. Latent Semantic Indexing An
    Overview. 2000
  • Deerwester, Dumais, Furnas, Lanouauer and
    Harshman. Indexing by Latent Semantic
    Understanding. 1990
  • http//www.undl.org
  • http//www.hirank.com/semantic-indexing-project/ls
    i/lsa_definition.htm
  • Manning and Schuetze. FSNLP. 1999-2002
  • Dr. Edel Garcia. Singular Value Decomposition
    (SVD) A Fast Track Tutorial. 2006
Write a Comment
User Comments (0)
About PowerShow.com