Information retrieval: overview - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Information retrieval: overview

Description:

Corpus: fixed collection of documents, typically 'nice' docs (e.g., NYT articles) ... TF = term frequency in document. DF = doc frequency of term (# docs with term) ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 22
Provided by: raymie
Category:

less

Transcript and Presenter's Notes

Title: Information retrieval: overview


1
Information retrieval overview
2
Information Retrieval and Text Processing
  • Huge literature dating back to the 1950s!
  • SIGIR/TREC - home for much of this
  • Readings
  • Salton, Wong, Yang, A Vector Space Model for
    Automatic Indexing, CACM Nov 75 V18N11
  • Tutle, Croft, Inference Networks for Document
    Retrieval, ???, OPTIONAL

3
IR/TP applications
  • Search
  • Filtering
  • Summarization
  • Classification
  • Clustering
  • Information extraction
  • Knowledge management
  • Author identification
  • and more...

4
Types of search
  • Recall -- finding documents one knows exists,
    e.g., an old e-mail message or RFC
  • Discovery -- finding interesting documents
    given a high-level goal
  • Classic IR search is focused on discovery

5
Classic discovery problem
  • Corpus fixed collection of documents, typically
    nice docs (e.g., NYT articles)
  • Problem retrieve documents relevant to users
    information need

6
Classical search
Task
Conception
Info Need
Formulation
Query
Search
Refinement
Corpus
Results
7
Definitions
  • Task example write a Web crawler
  • Information need perception of documents needed
    to accomplish task, e.g., Web specs
  • Query sequence of characters given to a search
    engine one hopes will return desired documents

8
Conception
  • Translating task into information need
  • Mis-conception identify too little (tips on
    high-bandwidth DNS lookups) and/or too much (TCP
    spec) as relevant to task
  • Sometimes a little extra breadth in results can
    tip user off to need to refine info need, but not
    much research into dealing with this automatically

9
Translation
  • Translating info need into query syntax of
    particular search engine
  • Mis-translation get this wrong
  • Operator error (is a b ab or ab ?)
  • Polysemy -- same word, different meanings
  • Synonimy -- different words, same meaning
  • Automation NLP, easy syntax, query
    expansion, QA

10
Refinement
  • Modification of query, typically in light of
    particular results, to better meet info need
  • Lots of work of refining query automatically
    (often with some input from user, e.g.,
    relevance feedback)

11
Precision and recall
  • Classic metrics of search-result goodness
  • Recall fraction of all good docs retrieved
  • relevant results / all relevant docs in
    corpus
  • Precision fraction of results that are good
  • relevant results / result-set size

12
Precision and recall
  • Recall/precision trade-off
  • Return everything gt great recall, bad precision
  • Return nothing gt great precision, bad recall
  • Precision curves
  • Search engine produces total ranking
  • Plot precision at 10, 20, .., 100 recall

13
Other metrics
  • Novelty / anti-redundancy
  • Information content of result set is disjoint
  • Comprehendible
  • Returned documents can be understood by user
  • Accurate / authoritative
  • Citation ranking!!
  • Freshness

14
Classic search techniques
  • Boolean
  • Ranked boolean
  • Vector space
  • Probabilistic / Bayesian

15
Term vector basics
  • Basic abstraction for information retrieval
  • Useful for measuring semantic similarity of text
  • A row in the above table is a term vector
  • Columns are word stems and phrases
  • Trying to capture meaning

16
Everythings a vector!!
  • Documents are vectors
  • Document collections are vectors
  • Queries are vectors
  • Topics are vectors

17
Cosine measurement of similarity
  • E1 . E2 / (E1E2) cos(E1,E2)
  • Rank docs against Qs, measure similarity of
    docs, etc.
  • In example
  • cos(doc1, doc2) 1/3
  • cos(doc1, doc3) 2/3
  • cos(doc2, doc3) 1/2
  • So doc13 are closest

18
Weighting of terms in vectors
  • Saltons TFIDF
  • TF term frequency in document
  • DF doc frequency of term ( docs with term)
  • IDF inverse doc freq. 1/DF
  • Weight of term TF IDF
  • Importance of term determined by
  • Count of term in doc (high gt important)
  • Number of docs with term (low gt important)

19
Relevance-feedback in VSM
  • Rocchio formula
  • Q FQ, Relevant, Irrelevant
  • Where F is weighted sum, such as
  • Qt aQtbsum_i R_ itcsum_i I_ it

20
Remarks on VSM
  • Principled way of solving many IR/text processing
    problems, not just search
  • Tons of variations on VSM
  • Different term weighting schemes
  • Different similarity formulas
  • Normalization itself is a huge sub-industry

21
All of this goes out on Web
  • Very small, unrefined queries
  • Recall not an issue
  • Quality is the issue (want most relevant)
  • Precision-at-ten matters (how many total losers)
  • Scale precludes heavy VSM techniques
  • Corpus assumptions (e.g., unchanging, uniform
    quality) do not hold
  • Adversarial IR - new challenge on Web
  • Still, VSM important tool for Web Archeology
Write a Comment
User Comments (0)
About PowerShow.com