Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Information Retrieval and Web Search

Description:

Controlled Vocabulary Free Text. Cross-Language Text Retrieval ... Easily generated from a vector of term weights. Multiply by the term-document matrix ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 30
Provided by: litCs
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Cross Language Information Retrieval
  • Instructor Rada Mihalcea
  • Class web page http//lit.csci.unt.edu/classes/C
    SCE5200
  • Some of the slides are from a course taught by
    Doug Oard at U. Maryland

2
The General Problem
  • Find documents written in any language
  • Using queries expressed in a single language

3
Why Do Cross-Language IR?
  • When users can read several languages
  • Eliminates multiple queries
  • Query in most fluent language
  • Monolingual users can also benefit
  • If translations can be provided
  • If it suffices to know that a document exists
  • If text captions are used to search for images

9
4
Source Michael Lesk, How Much Information is
there in the World?
5
Supply Side Internet Hosts
Guess What will be the most widely used
language on the Web in 2010?
Source Network Wizards Jan 99 Internet Domain
Survey
6
Demand Side Number of Speakers
Source http//www.g11n.com/faq.html
7
Search Technology
Chinese Feature Assignment
Monolingual Chinese Matching
1 0.72 2 0.48
Language Identification
Chinese Feature Assignment
Chinese Query
English Feature Assignment
Cross- Language Matching
3 0.91 4 0.57 5 0.36
8
Language Identification
  • Can be specified using metadata
  • Included in HTTP and HTML
  • Can be determined using word-scale features
  • Which dictionary gets the most hits?
  • Can be determined using subword features
  • Letter n-grams, for example

24
9
Design Decisions
  • What to index?
  • Free text or controlled vocabulary
  • What to translate?
  • Queries or documents
  • Where to get translation knowledge?

10
10
Query Vector Translation
Chinese Query Features
Query (Vector) Translation
Monolingual English Matching
3 0.91 4 0.57 5 0.36
English Document Features
11
Document Vector Translation
Chinese Query Features
English Document Features
Monolingual Chinese Matching
3 0.91 4 0.57 5 0.36
Document (Vector) Translation
12
Matching Interlingual Representations
Chinese Query Features
Query Folding In
English Document Features
Interlingual Matching
3 0.91 4 0.57 5 0.36
Document Folding In
13
Query vs. Document Translation
  • Query translation
  • Very efficient for short queries
  • Not as big an advantage for relevance feedback
  • Hard to resolve ambiguous query terms
  • Document translation
  • May be needed by the selection interface
  • And supports adaptive filtering well
  • Slow, but only need to do it once per document
  • Poor scale-up to large numbers of languages

23
14
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
11
15
Translation Knowledge
  • A lexicon
  • e.g., extract term list from a bilingual
    dictionary
  • Corpora
  • Parallel or comparable, linked or unlinked
  • Algorithmic
  • e.g., transliteration rules, cognate matching
  • The user

16
Types of Lexicons
  • Ontology
  • Representation of concepts and relationships
  • Thesaurus
  • Ontology specialized for retrieval
  • Bilingual lexicon
  • Ontology specialized for machine translation
  • Bilingual dictionary
  • Ontology specialized for human translation

22
17
Multilingual Thesauri
  • Adapt the knowledge structure
  • Cultural differences influence indexing choices
  • Use language-independent descriptors
  • Matched to a unique term in each language
  • Three construction techniques
  • Build it from scratch
  • Translate an existing thesaurus
  • Merge monolingual thesauri

16
18
Machine Readable Dictionaries
  • Based on printed bilingual dictionaries
  • Becoming widely available
  • Used to produce bilingual term lists
  • Cross-language term mappings are accessible
  • Sometimes listed in order of most common usage
  • Some knowledge structure is also present
  • Hard to extract and represent automatically
  • The challenge is to pick the right translation

27
19
Unconstrained Query Translation
  • Replace each word with every translation
  • Typically 5-10 translations per word
  • About 50 of monolingual effectiveness
  • Ambiguity is a serious problem
  • Example Fly (English)
  • 8 word senses (e.g., to fly a
    flag)
  • 13 Spanish translations (enarbolar, ondear, )
  • 38 English retranslations (hoist, brandish, lift)

28
20
Exploiting Part-of-Speech Tags
  • Constrain translations by part of speech
  • Noun, verb, adjective,
  • Effective taggers are available
  • Works well when queries are full sentences
  • Short queries provide little basis for tagging
  • Constrained matching can hurt monolingual IR
  • Nouns in queries often match verbs in documents
  • This is why stemming usually improves performance

29
21
Phrase Indexing
  • Improves retrieval effectiveness two ways
  • Phrases are less ambiguous than single words
  • Idiomatic phrases translate as a single concept
  • Three ways to identify phrases
  • Semantic (e.g., appears in a dictionary)
  • Syntactic (e.g., parse as a noun phrase)
  • Cooccurrence (words found together often)
  • Semantic phrase results are impressive

30
22
Types of Bilingual Corpora
  • Parallel corpora translation-equivalent pairs
  • Document pairs
  • Sentence pairs
  • Term pairs
  • Comparable corpora
  • Content-equivalent document pairs
  • E.g. newspaper articles in different languages,
    on the same day (for the same event)
  • Unaligned corpora
  • Content from the same domain

32
23
Pseudo-Relevance Feedback
  • Enter query terms in French
  • Find top French documents in parallel corpus
  • Construct a query from English translations
  • Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
Alta Vista
33
24
Learning From Document Pairs
  • Count how often each term occurs in each pair
  • Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
34
25
Similarity-Based Dictionaries
  • Automatically developed from aligned documents
  • Terms E1 and E3 are used in similar ways
  • Terms E1 S1 (or E3 S4) are even more similar
  • For each term, find most similar in other
    language
  • Retain only the top few (5 or so)
  • Performs as well as dictionary-based techniques
  • Evaluated on a comparable corpus of news stories
  • Stories were automatically linked based on date
    and subject

35
26
Generalized Vector Space Model
  • Term space of each language is different
  • But the document space for a corpus is the same
  • Describe new documents based on the corpus
  • Vector of cosine similarity to each corpus
    document
  • Easily generated from a vector of term weights
  • Multiply by the term-document matrix
  • Compute cosine similarity in document space
  • Excellent results when the domain is the same

36
27
Latent Semantic Indexing
  • Designed for better monolingual effectiveness
  • Works well across languages too
  • Cross-language is just a type of term choice
    variation
  • Produces short dense document vectors
  • Better than long sparse ones for adaptive
    filtering
  • Training data needs grow with dimensionality
  • Not as good for retrieval efficiency
  • Always 300 multiplications, even for short queries

37
28
Sentence-Aligned Parallel Corpora
  • Easily constructed from aligned documents
  • Match pattern of relative sentence lengths
  • Not yet used directly for effective retrieval
  • But all experiments have included domain shift
  • Good first step for term alignment
  • Sentences define a natural context

38
29
Cooccurrence-Based Translation
  • Align terms using cooccurrence statistics
  • How often do a term pair occur in sentence pairs?
  • Weighted by relative position in the sentences
  • Retain term pairs that occur unusually often
  • Useful for query translation
  • Excellent results when the domain is the same
  • Also practical for document translation
  • Term usage reinforces good translations

39
30
Exploiting Unaligned Corpora
  • Documents about the same set of subjects
  • No known relationship between document pairs
  • Easily available in many applications
  • Two approaches
  • Use a dictionary for rough translation
  • But refine it using the unaligned bilingual
    corpus
  • Use a dictionary to find alignments in the corpus
  • Then extract translation knowledge from the
    alignments

40
31
CLIR Evaluation Resources
  • Electronic texts
  • Text Retrieval Conference (E, F, G, I)
  • Topic Detection and Tracking (E, C)
  • Document images
  • No evaluation programs yet
  • Recorded speech
  • Topic Detection and Tracking (E, C)
  • Sign language
  • No evaluation programs yet
  • CLEF Evaluation
  • http//clef.iei.pi.cnr.it2002/

8
Write a Comment
User Comments (0)
About PowerShow.com