CS798: Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

CS798: Information Retrieval

Description:

Information retrieval is concerned with representing, searching, and ... Probabilistic retrieval and Okapi BM25F. Language modeling. Divergence from randomness ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 31
Provided by: plgUwa
Category:

less

Transcript and Presenter's Notes

Title: CS798: Information Retrieval


1
CS798 Information Retrieval
Charlie Clarke claclark_at_plg.uwaterloo.ca Inform
ation retrieval is concerned with representing,
searching, and manipulating large collections of
human-language data.
2
Housekeeping
Web page http//plg.uwaterloo.ca/claclark/cs798
Area Applications/Databases Meeting times
Mondays, 200-500, MC2036
3
NLP
DB
IR
ML
4
Topics
  • Basic techniques
  • Searching, browsing, ranking, retrieval
  • Indexing algorithms and data structures
  • Evaluation
  • Application areas

5
1. Basic Techniques
  • Text representation Tokenization
  • Inverted indices
  • Phrase searching example
  • Vector space model
  • Boolean retrieval
  • Simple proximity ranking
  • Test collections Evaluation

6
2. Retrieval and Ranking
  • Probabilistic retrieval and Okapi BM25F
  • Language modeling
  • Divergence from randomness
  • Passage retrieval
  • Classification
  • Learning to rank
  • Implicit user feedback

7
3. Indexing
  • Algorithms and data structures
  • Index creation
  • Dynamic update
  • Index compression
  • Query processing
  • Query optimization

8
4. Evaluation
  • Statistical foundations of evaluation
  • Measuring Efficiency
  • Measuring Effectiveness
  • Recall/Precision
  • NDCG
  • Other measures
  • Building a test collection

9
5. Application Areas
  • Parallel retrieval architectures
  • Web search (Link analysis/Pagerank)
  • XML retrieval
  • Filesystem search
  • Spam filtering

10
Other Topics (student projects)
  • Image/video/speech retrieval
  • Web spam
  • Cross- and multi-lingual IR
  • Clustering
  • Advertising/Recommendation
  • Distributed IR/Meta-search
  • Question answering
  • etc.

11
Resources
  • Textbook (partial draft on Website)
  • Büttcher, Clarke Cormack. Information
    Retrieval Data Structures, Algorithms and
    Evaluation.
  • (start reading ch. 1-3)
  • Wumpus
  • www.wumpus-search.org

12
Grading
  • Short homework exercises from text (10)
  • A literature review based on a topic area
    selected by the student with the agreement of the
    instructor (30)
  • 30-minute presentation on your selected topic
    (20)
  • Class project (40) details coming up..

13
(No Transcript)
14
Documents
  • Documents are the basic units of retrieval in an
    IR system.
  • In practice they might be Web pages, email
    messages, LaTeX files, news articles, phone
    message, etc.
  • Update add, delete, append(?), modify(?)
  • Passages and XML elements are other possible
    units of retrieval.

15
Probability Ranking Principle
  • If an IR systems response to a query is a
    ranking of the documents in the collection in
    order of decreasing probability of relevance, the
    overall effectiveness of the system to its users
    will be maximized.

16
Evaluating IR systems
  • Efficiency vs. effectiveness
  • Manual evaluation
  • Topic creation and judging
  • TREC (Text REtreival Conference)
  • Google Has 10,000 Human Evaluators?
  • Evaluation through implicit user feedback
  • Specificity vs. exhaustivity

17
lttopicgt lttitlegt shark attacks lt/titlegt ltdescgt Wh
ere do shark attacks occur in the
world? lt/descgt ltnarrgt Are there beaches or other
areas that are particularly prone to shark
attacks? Documents comparing areas and providing
statistics are relevant. Documents describing
shark attacks at a single location are not
relevant. lt/narrgt lt/topicgt
18
(No Transcript)
19
(No Transcript)
20
Class ProjectWikipedia Search
  • Can we outperform Google on the Wikipedia?
  • Basic project Build a search engine for the
    Wikipedia (using any tools you can find).
  • Ideas Pagerank, spelling, structure, element
    retrieval, summarization, external information,
    user interfaces

21
Class Project Evaluation
  • Each student will create and judge n topics.
  • The value of n depends on the number of students.
    (But workload stays the same.)
  • Quantitative measure of effectiveness.
  • Qualitative assessment of user interfaces.
  • Volunteer needed to operate the judging interface
    (for credit).

22
Class Project Organization
  • You may work in groups (check with me).
  • You may work individually (check with me).
  • You may create and share tools with other
    students. You get the credit. (e.g. Volunteer
    needed to set up a class wiki.)
  • Programming cant be avoided, but can be
    minimized. ?
  • Programming can also be maximized.

23
Class Project Grading
  • Topic creation and judging 10
  • Other project work 30
  • You are responsible for submitting one
    experimental run for evaluation.
  • Other activities are up to you.

24
(No Transcript)
25
One line?
26
Tokenization
  • For English text Treat each string of
    alphanumeric characters as a token.
  • Number sequentially from the start of the text
    collection.
  • For non-English text Depends on the language
    (possible student projects)
  • Other considerations Stemming, stopwords, etc.

27
(No Transcript)
28
Inverted Indices
  • Basic data structure
  • More next day

29
(No Transcript)
30
Plan
  • Sept 17
  • Inverted indices (from Chapter 3)
  • Index construction/Wumpus (Stefan)
  • Sept 24
  • Vector space model, Boolean retrieval, proximity
  • Basic evaluation methods
  • October 1
  • Probabilistic retrieval, language modeling
  • Start topic creation for class project
  • October 8 Web search
Write a Comment
User Comments (0)
About PowerShow.com