Semantically Motivated Information Retrieval - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Semantically Motivated Information Retrieval

Description:

Construct the document vector. Will perform better in some cases. Example ... Constructing Better Document Vectors Using Universal Networking Language (UNL) ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 25

Provided by: rohitashw

Category:

more less

Transcript and Presenter's Notes

Title: Semantically Motivated Information Retrieval

1
Semantically Motivated Information Retrieval

B.Tech Project First Stage

Rohitashwa Bhotica (04005010) Under the guidance
of - Prof. Pushpak Bhattacharyya
2
OUTLINE

Introduction
Information Retrieval
Universal Networking Language (UNL)
Vector Construction using UNL
Latent Semantic Indexing (LSI)
Term Document Matrix
Singular Value Decomposition
Advantages and Disadvantages
Conclusion and Future Work

3
INTRODUCTION

Information retrieved by matching terms of a
document is not that effective
Terms might not match in similar sentences
We should take the dependencies between words
also into account
Retrieve on basis of conceptual meaning
We learn how UNL and LSI can help us in
retrieving information using the semantic
relations between the words

4
Information Retrieval

Encode a document in a vector
Document is an element in vector space
Fact that words are semantically related is not
considered for representation
Term Frequency (TF)
Component of the vector is the frequency of the
word in that particular document

5
Information Retrieval (contd.)

Inverse Document Frequency (IDF)
IDF of term t is given as -
Where N is the number of documents and Nt is the
number of documents having term t
TFIDF method
Both TF and IDF scores are used to calculate
value
We do not consider the redundant words (for e.g.
and, for) while constructing the vector

6
Example

Consider the sentences -
S1 Michael eats an apple standing beside the
tree
S2 The apple tree stands besides Michaels
house
The document vectors using TF are -
VS1 lt1,1,1,1,1,1,0gt and VS2 lt1,0,1,1,1,1,1gt
Problem - Sentences have same words but mean
different things
Not considering semantic relations of the words
We now see a better representation using UNL

7
Universal Networking Language

Artificial Language replicates functions of
natural languages in communication
Enables processing of information and knowledge
across language barriers
Provides a linguistic interface to computers
Expression is a hyper-graph representing any
sentence in the form of a semantic network
Expression is well-defined and unambiguous

8
UNL (contd.)

Composed of Universal words (UWs) which are
linked by relations to form sentences
Universal Words (UWs) are the nodes in the graph
and constitute the vocabulary of UNL
Relations define objectivity information of
sentences linking two UWs to each other
Attributes show the speakers point of view and
describe the subjectivity information

9
Example
UNL Graph for The bachelor books a room for 2
people.
10
Vector Construction using UNL

UWs are used as components of document vector
Weight of node dependent on number of links on it
Relations are divided into four categories-
Transferable - Weight of parent added to child
node. e.g. obj, agt
Equal Weight - Weight of parent equal to weight
of child. e.g. coo, or
Partial Transferable - Link is given weight of
2. e.g. plt, to, via
Non Transferable - Weight is not transferred
from parent. e.g. qua
Semantic information present in sentence is thus
used now to construct the document vector

11
Method of Construction

TFIDF can also be used. The document is
constructed from UNL by these steps -
Construct a UNL graph from the document
Count the links to and from each UW in the graph
Adjust count depending upon connecting relation
Add up all the counts for the UW for the document
Multiply this with the IDF
Construct the document vector
Will perform better in some cases

12
Example

Graph of John is going to school eating an apple
The vector is given as V lt4, 3, 2, 3, 4gt
resulting from previous construction

13
Latent Semantic Indexing

Variant of the vector retrieval method
Dependencies between words taken into account
Information is thus retrieved on basis of
conceptual topic and meaning of document
Query and document can have high similarity if
they do not share any words but mean the same
thing
user and interface have high co-occurrence with
HCI and interaction even though no terms are
common
Overcomes problems in lexical matching such as
synonymy and polysemy

14
Latent Semantic Indexing (contd.)

Uses statistically derived conceptual indices
Documents having many words in common are
considered to be semantically close
Pure mathematical approach
Does not understand meaning of document
Makes no use of word order, syntactic relations
or logic, or of morphology
Considers only content words while indexing

15
Search for Content Words

One method of generating content words is -
List all the words that appear in the documents
Discard the articles, prepositions and
conjunctions
Discard common words such as (know, see, do)
Discard the pronouns
Discard frilly words such as (therefore, thus,
etc)
Remove the words that appear in all documents
Remove words which occur in only one document
We thus get the Term Document Matrix

16
Example

Treasury Secretary Paul O'Neill expressed
irritation on Wednesday that European countries
have refused to go along with a U.S. proposal to
boost the amount of direct grants rich nations
offer poor countries
Step 1 Remove the formatting etc.
treasury secretary paul o'neill expressed
irritation wednesday that european countries have
refused to go along with a us proposal to boost
the amount of direct grants rich nations offer
poor countries
Step 2 Remove the common english words. We now
get -
treasury secretary paul o'neill expressed
irritation european countries refused US proposal
boost direct grants rich nations poor countries
Step 3 - Remove the plurals etc. Stem the
documents. Now left with
countri (2) direct europ express grant irritat
nation o'neill paul poor propos refus rich
secretar treasuri US

17
Term Document Matrix

Each row stands for word and column for text
passage
Each cell will contain the frequency or (TFIDF)
of the occurrence of word in particular document
All documents are normalized
The matrix is then solved by SVD when we get
query
We try to reduce false alarms and misses which
are types of retrieval failures where we find
either irrelevant or miss some relevant documents
LSI reduces dimensionality of an IR problem

18
Singular Value Decomposition

Breaks matrix into set of smaller components
Dimensionality of matrix is reduced
Noise of the term document matrix is lost
Similar things become similar and vice versa
A SVD of any M x N matrix A is any factorization
of the form
A U?VT
U is an M x M orthogonal matrix, V is a N x N
orthogonal matrix and ? is an M x N diagonal
matrix. ? contains singular values of A in
descending order

19
Example

We can see that U and V are unit length and
orthogonal
We can restrict U, V and ? to its first k rows
and get Ã
Ã is best square approximation of A by matrix of
rank k
Co-occurring terms will be matched on same
dimension
Thus similarity is increased in representation

20
Example

Matrix B2xd E2x2 DT2xd of documents with k 2
Matrix of document correlations ETE where E is B
with length normalized columns.

21
User Queries and Updating

Query must be represented as k dimensional vector
Then compared to all document vectors which are
then ranked by their similarity to query
Would not want to compute SVD on all updates
Use an existing SVD to represent information
Compute SVD once and then rest are folded-in
Equation for folding-in documents is -
A TSDT
TT A TT TSDT
TT A SDT
T is the term matrix and A is the vector to be
updated

22
Advantages and Disadvantages

Advantages
Better representation of documents and queries
Problems due to polysemy and synonymy are thus
avoided as we look at semantic relations
Disadvantages
SVD representation takes up more space than a
sparse vector representation
SVD computation has high pre-processing costs

23
Conclusion and Future Work

Features of UNL and LSI can improve information
retrieval
SVD helps us in calculating similarity between
documents and terms
Further on want to implement this.
Classify documents of a Wikipedia dump using
features of UNL and LSI

24
BIBLIOGRAPHY

Universal Networking Language (UNL)
Specifications Version 2005
Shah, Choudhary, Bhattacharyya. Constructing
Better Document Vectors Using Universal
Networking Language (UNL). Mumbai, India, 2002
Mukerjee, Raina, Kapil, Goyal, Shukla. Universal
Networking Language A Tool for
Language-Independent Semantics. Kanpur, India
Landauer, Foltz, Laham. An Introduction to Latent
Semantic Analysis
Barbara Rosario. Latent Semantic Indexing An
Overview. 2000
Deerwester, Dumais, Furnas, Lanouauer and
Harshman. Indexing by Latent Semantic
Understanding. 1990
http//www.undl.org
http//www.hirank.com/semantic-indexing-project/ls
i/lsa_definition.htm
Manning and Schuetze. FSNLP. 1999-2002
Dr. Edel Garcia. Singular Value Decomposition
(SVD) A Fast Track Tutorial. 2006