Title: Modern Information Retrieval Chapter 1: Introduction
 1Modern Information RetrievalChapter 1 
Introduction
- Ricardo Baeza-Yates 
- Berthier Ribeiro-Neto
2Motivation
- Example of the user information need 
- Topic NCAA college tennis team 
- Description Find all the pages (documents) 
 containing information on college tennis teams
 which (1) are maintained by an university in the
 USA and (2) participate in the NCAA tennis
 tournament.
- Narrative To be relevant, the page must include 
 information on the national ranking of the team
 in the last three years and the email or phone
 number of the team coach.
3IR Research
- Information retrieval vs Data retrieval 
- Research 
- information search 
- information filtering (routing) 
- document classification and categorization 
- user interfaces and data visualization 
- cross-language retrieval
4IR History
  5The User Task
- Retrieval (Searching) 
- classic information search process where clear 
 objectives are defined
- Browsing 
- a process where ones main objectives are not 
 clearly defined and might change during the
 interaction with the system
6Logical View of the Documents
- Text Operations 
- reduce the complexity of the document 
 representation
- a full text ? a set of index terms 
- Steps 
- 1. Stopwords removing 
- 2. Stemming 
- 3. Noun groups 
- 4. ...
7Past, Present, and Future
- Early Development 
- Index 
- Library 
- Author name, title, subject headings, keywords 
- The Web and Digital Libraries 
- Hyperlinks
8Conventional Text-Retrieval Systems Automatic 
Text Processing
- G. Salton, Addison-Wesley, 1989. 
- (Chapter 9) 
9Data Retrieval
- A specified set of attributes is used to 
 characterize each record.EMPLOYEE(NAME, SSN,
 BDATE, ADDR, SEX, SALARY, DNO)
- Exact match between the attributes used inquery 
 formulations and those attached to the document.
 SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
 John Smith
10Text-Retrieval Systems
- Content identifiers (keywords, index terms, 
 descriptors) characterize the stored texts.
- Degrees of coincidence between the sets of 
 identifiers attached to queries and documents
content analysis
query formulation 
 11Possible Representation
- Document representation (Text operation) 
- unweighted index terms (term vectors) 
- weighted index terms 
-  
- Query (Query operation) 
- unweighted or weighted index terms 
- Boolean combinations (or, and, not) 
-  
- Search operation must be effective 
- (Indexing) 
12File Structures
- Main requirements 
- fast-access for various kinds of searches 
- large number of indices 
- Alternatives 
- Inverted Files 
- Signature Files 
- PAT trees 
13Inverted Files
- File is represented as an array of indexed 
 documents.
14Inverted-file process
- The document-term array is inverted (transposed).
15Inverted-file process (Continued)
- Take two or more rows of an inverted 
 term-document array, and produce a single
 combined list of document identifiers.
- Ex Query (term2 and term3) 
-  term2 1 1 0 0term3 0 1 1 1--------------------
 ---------------------------------- 1 lt-- D2
16List-merging for two ordered lists
- The inverted-index operations to obtain answers 
 are based on list-merging process.
- ExampleT1 D1, D3T2 D1, D2Merged(T1, 
 T2) D1, D1, D2, D3
17Extensions of Inverted Index Operations(Distance 
Constraints)
- Distance Constraints 
- (A within sentence B)terms A and B must co-occur 
 in a common sentence
- (A adjacent B)terms A and B must occur 
 adjacently in the text
18Extensions of Inverted Index Operations(Distance 
Constraints)
- Implementation 
- include term-location in the inverted 
 indexesinformation P345, P348, P350,
 retrieval P123, P128, P345,
- include sentence-location in the indexes 
-  information  P345, 25 P345, 37 P348, 10 
 P350, 8  retrieval  P123, 5 P128, 25
 P345, 37 P345, 40
19Extensions of Inverted Index Operations(Distance 
Constraints)
- Include paragraph numbers in the indexessentence 
 numbers within paragraphsword numbers within
 sentencesinformation P345, 2, 3, 5
 retrieval P345, 2, 3, 6
- Query examples(information adjacent 
 retrieval)(information within five words
 retrieval)
- Cost the size of indexes
20Retrieval models
Set Theoretic
Fuzzy Extended Boolean
Classic Models
Boolean Vector Probabilistic
Algebraic
Generalized Vector Latent Semantic Index Neural 
Networks
Probabilistic
Inference Network Belief Network 
 21Classic IR Model
- Basic concepts  Each document is described by a 
 set of representative keywords called index
 terms.
- Assign a numerical weights to distinct relevance 
 between index terms.
22Boolean model
- Binary decision criterion 
- Data retrieval model 
- Advantage 
- clean formalism, simplicity 
- Disadvantage 
- It is not simple to translate an information need 
 into a Boolean expression.
- exact matching may lead to retrieval of too few 
 or too many documents
23Vector model
- Assign non-binary weights to index terms in 
 queries and in documents. gt TFxIDF
- Compute the similarity between documents and 
 query. gt Sim(Dj, Q)
- More precise than Boolean model.
24Term Weights
- Term WeightsDiTi1, 0.2 Ti2, 0.5 Ti3, 0.6 
- Issues 
- How to generate the term weights? 
- How to apply the term weights? 
- Sum the weights of all document terms that match 
 the given query.
- Rank the output documents in the descending order 
 of term weight.
25Boolean Query with Term Weights
- Transform a Boolean expression into disjunctive 
 normal form. T1 and (T2 or T3)  (T1 and T2)
 or (T1 and T3)
- For each conjunct, compute the minimum term 
 weight of any document term in that conjunct.
- The document weight is the maximum of all the 
 conjunct weights.
26Boolean Query with Term Weights
- Example Q(T1 and T2) or T3Document Conjunct Q
 ueryVectors Weights Weight (T1 and T2) (T3)
 (T1 and T2) or T3D1(T1,0.2T2,0.5T3,0.6) 0
 .2 0.6 0.6D2(T1,0.7T2,0.2T3,0.1) 0.2 0.1
 0.2D1 is preferred.
27Summary
- Conventional IR systems 
- Evaluation 
- Text operations (Term selection) 
- Query operations (Pattern matching, Relevance 
 feedback)
- Indexing (File structure) 
- Modeling 
28Resources
- Journals 
- Journal of American Society of Information 
 Sciences
- ACM Transactions on Information Systems 
- Information Processing and Management 
- Information Systems (Elsevier) 
- Knowledge and Information Systems (Springer) 
- Conferences 
- ACM SIGIR, DL, CIKM, CHI, etc. 
- Text Retrieval Conference (TREC)