Title: CS798: Information Retrieval
1CS798 Information Retrieval
Charlie Clarke claclark_at_plg.uwaterloo.ca Inform
ation retrieval is concerned with representing,
searching, and manipulating large collections of
human-language data.
2Housekeeping
Web page http//plg.uwaterloo.ca/claclark/cs798
Area Applications/Databases Meeting times
Mondays, 200-500, MC2036
3NLP
DB
IR
ML
4Topics
- Basic techniques
- Searching, browsing, ranking, retrieval
- Indexing algorithms and data structures
- Evaluation
- Application areas
51. Basic Techniques
- Text representation Tokenization
- Inverted indices
- Phrase searching example
- Vector space model
- Boolean retrieval
- Simple proximity ranking
- Test collections Evaluation
62. Retrieval and Ranking
- Probabilistic retrieval and Okapi BM25F
- Language modeling
- Divergence from randomness
- Passage retrieval
- Classification
- Learning to rank
- Implicit user feedback
73. Indexing
- Algorithms and data structures
- Index creation
- Dynamic update
- Index compression
- Query processing
- Query optimization
84. Evaluation
- Statistical foundations of evaluation
- Measuring Efficiency
- Measuring Effectiveness
- Recall/Precision
- NDCG
- Other measures
- Building a test collection
95. Application Areas
- Parallel retrieval architectures
- Web search (Link analysis/Pagerank)
- XML retrieval
- Filesystem search
- Spam filtering
10Other Topics (student projects)
- Image/video/speech retrieval
- Web spam
- Cross- and multi-lingual IR
- Clustering
- Advertising/Recommendation
- Distributed IR/Meta-search
- Question answering
- etc.
11Resources
- Textbook (partial draft on Website)
- Büttcher, Clarke Cormack. Information
Retrieval Data Structures, Algorithms and
Evaluation. - (start reading ch. 1-3)
- Wumpus
-
-
- www.wumpus-search.org
12Grading
- Short homework exercises from text (10)
- A literature review based on a topic area
selected by the student with the agreement of the
instructor (30) - 30-minute presentation on your selected topic
(20) - Class project (40) details coming up..
13(No Transcript)
14Documents
- Documents are the basic units of retrieval in an
IR system. - In practice they might be Web pages, email
messages, LaTeX files, news articles, phone
message, etc. - Update add, delete, append(?), modify(?)
- Passages and XML elements are other possible
units of retrieval.
15Probability Ranking Principle
- If an IR systems response to a query is a
ranking of the documents in the collection in
order of decreasing probability of relevance, the
overall effectiveness of the system to its users
will be maximized.
16Evaluating IR systems
- Efficiency vs. effectiveness
- Manual evaluation
- Topic creation and judging
- TREC (Text REtreival Conference)
- Google Has 10,000 Human Evaluators?
- Evaluation through implicit user feedback
- Specificity vs. exhaustivity
17lttopicgt lttitlegt shark attacks lt/titlegt ltdescgt Wh
ere do shark attacks occur in the
world? lt/descgt ltnarrgt Are there beaches or other
areas that are particularly prone to shark
attacks? Documents comparing areas and providing
statistics are relevant. Documents describing
shark attacks at a single location are not
relevant. lt/narrgt lt/topicgt
18(No Transcript)
19(No Transcript)
20Class ProjectWikipedia Search
- Can we outperform Google on the Wikipedia?
- Basic project Build a search engine for the
Wikipedia (using any tools you can find). - Ideas Pagerank, spelling, structure, element
retrieval, summarization, external information,
user interfaces
21Class Project Evaluation
- Each student will create and judge n topics.
- The value of n depends on the number of students.
(But workload stays the same.) - Quantitative measure of effectiveness.
- Qualitative assessment of user interfaces.
- Volunteer needed to operate the judging interface
(for credit).
22Class Project Organization
- You may work in groups (check with me).
- You may work individually (check with me).
- You may create and share tools with other
students. You get the credit. (e.g. Volunteer
needed to set up a class wiki.) - Programming cant be avoided, but can be
minimized. ? - Programming can also be maximized.
23Class Project Grading
- Topic creation and judging 10
- Other project work 30
- You are responsible for submitting one
experimental run for evaluation. - Other activities are up to you.
24(No Transcript)
25One line?
26Tokenization
- For English text Treat each string of
alphanumeric characters as a token. - Number sequentially from the start of the text
collection. - For non-English text Depends on the language
(possible student projects) - Other considerations Stemming, stopwords, etc.
27(No Transcript)
28Inverted Indices
- Basic data structure
- More next day
29(No Transcript)
30Plan
- Sept 17
- Inverted indices (from Chapter 3)
- Index construction/Wumpus (Stefan)
- Sept 24
- Vector space model, Boolean retrieval, proximity
- Basic evaluation methods
- October 1
- Probabilistic retrieval, language modeling
- Start topic creation for class project
- October 8 Web search