Title: Search Engines
1Search Engines
- Hadi Amiri
- Database Research Group
- ECE Department, University of Tehran
- email h.amiri_at_ece.ut.ac.ir
- web khorshid.ut.ac.ir/h.amiri
2Outline
- Part 1. Web
- Information Retrieval
- Evaluation Method
- Web Characteristics
- Search Engines
- Search Engine Query Logs
- Lakes and Federated Search
- Part 2. Open Source Search Engines
- Lemur and Indri
- Lemur Analysis
3Outline
- Part 1. Web
- Information Retrieval
- Evaluation Method
- Web Characteristics
- Search Engines
- Search Engine Query Logs
- Lakes and Federated Search
- Part 2. Open Source Search Engines
- Lemur and Indri
- Lemur Analysis
4Information Retrieval Systems
Representation (LV),
Storage,
Organization and
Access
Retrieval Engine
to Information Items
5Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
6Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7Text Operations
8The Retrieval Process in Detail
9The Central Problem in IR
Authors
Information Seeker
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
10IR Evaluation
- Three components of a test collection
- Information Repository Collection of documents
- Queries Set of information needs
- Relevance Judgments Sets of documents that
satisfy the information needs
11IR Evaluation Cont.
Query q
And some other performance measures derived from
them
12IR Evaluation Cont.
Example
No. of Relevant Documents is 5
13IR Evaluation Cont.
Interpolation
Interpolation
Using TREC_EVAL
14Web Characteristics
- Distributed Data
- Visible and Invisible (Hidden) Web
- Volatile data 40 / month
- Very large volume
- Very large answers
- 1998 3,000,000 servers, 350,000,000 pages.
- 2003 Only Google 3,307,998,701 pages (10 times
more) - Unstructured and redundant data. 30 are
duplicates - Quality of data
- Heterogeneity
- data (languages, alphabets Chinese)
- Users (inexperienced)
15Search Engines
And Their Services?
16Search Engines
Indexed Documents
Search Engine
Web
Public Interface
D
Index
Target function
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Newest In Google
22(No Transcript)
23Search Engine Query Logs
- Query Logs Contain
- Query
- Duration
- IP
- Clicks
- Date
- .
Query Log
24Search Engine Query Logs
- Trend Detection
- Social Analysis
- Control Trends
- Term Analysis
- CTD Decrement
-
25Sponsored Search Results
26Term Distribution
Frequency in query-logs
Queries
A good idea is to find the inexpensive or
middle-expensive keywords that are highly
similar to the expensive keywords! Methods such
as semantic clustering and text mining are
applicable
27Lacks In Search Engines
- Need to Locally Store Data (Documents)
- Distributed Data
- Visible and Invisible (Hidden) Web
- Hidden Web Information is hidden from
conventional engines - No arbitrary crawl of the data (e.g., ACM
library) - Updated too frequently to be crawled (e.g.,
buy.com) - Larger than Visible Web (2-50 times) Sherman,
2001 (500 times) Bergman, 2001 - Created by professionals
28Lacks and Shortages Cont.
Different Type of Hidden Information Source
- Allow access to their contents via the source
specific search interface - Allow their contents to be copied by conventional
search engines - The access to contents is subject to fee or
subscription
29Lacks and Shortages Cont.
Some Examples
- US Patent and Trademark Office (USPTO) Database
- National Science Foundations Award Database
- U.S. Government Printing Office (GPO) Portal
- National Institutes of Healths GeneBank
1 http//www.uspto.gov/patft/index.html 2
http//www.nsf.gov/awardsearch/
3 http//www.gpoaccess.gov/databases.html 4
http//www.ncbi.nih.gov/Genbank/
30Solution Distributed Information Retrieval or
Federated Search
- The alternative to a Single-Database is a
Multi-Database model
DIR Engine
Retrieval Engine1
Retrieval Engine 2
Retrieval Engine 3
Retrieval Engine n
31Federated Search Cont.
Cooperative and Uncooperative
32Outline
- Part 1. Web
- Information Retrieval
- Evaluation Method
- Web Characteristics
- Search Engines
- Search Engine Query Logs
- Lakes and Federated Search
- Part 2. Open Source Search Engines
- Lemur and Indri
- Lemur Analysis
33Open Source Search Engines
- Lemur CIIR and LTI Lab
- Indri CIIR and LTI Lab
- Lucene
- Terrier University of Glasgow
- Xapian
34Open Source Search Engines- Comparison
35Open Source Search Engines- Comparison
36Open Source Search Engines- Comparison
37Open Source Search Engines- Lemur
38Slides From
- Lemur Toolkit Tutorial
- Paul Ogilvie
- Trevor Strohman
- INDRI Overview
- Don Metzler
And
39Zoology 101
- Lemurs are primates found only in Madagascar
- 50 species (17 are endangered)
- Ring-tailed lemurs
- lemur catta
40Zoology 101
- The indri is the largest type of lemur
- When first spotted the natives yelled Indri!
Indri! - Malagasy for "Look! Over there!"
41Installation
- Linux, OS/X
- Extract software/lemur-4.3.2.tar.gz
- ./configure --prefix/install/path
- ./make
- ./make install
- Windows
- Run software/lemur-4.3.2-install.exe
- Documentation in windoc/index.html
42Steps
- Building an index
- Running queries
- Evaluating results
43Indexing
- Document Preparation
- Indexing Parameters
- Time and Space Requirements
44Indexing Document Preparation
Document Formats The Lemur Toolkit can
inherently deal with several different document
format types without any modification
- TREC Text
- TREC Web
- Plain Text
- Microsoft Word()
- Microsoft PowerPoint()
() Note Microsoft Word and Microsoft PowerPoint
can only be indexed on a Windows-based machine,
and Office must be installed.
45Indexing Document Preparation
- If your documents are not in a format that the
Lemur Toolkit can inherently process - If necessary, extract the text from the document.
- Wrap the plaintext in TREC-style wrappers
- ltDOCgt
- ltDOCNOgtdocument_idlt/DOCNOgt
- ltTEXTgt
- Index this document text.
- lt/TEXTgt
- lt/DOCgt
- or
- For more advanced users, write your own parser
to extend the Lemur Toolkit.
46Indexing - Parameters
- Basic usage to build index
- IndriBuildIndex ltparameter_filegt
- Parameter file includes options for
- Where to find your data files
- Where to place the index
- How much memory to use
- Stopword, stemming, fields
- Many other parameters.
47Indexing - Parameters
- Standard parameter file specification an XML
document - ltparametersgt
- ltoptiongtlt/optiongt
- ltoptiongtlt/optiongt
-
- ltoptiongtlt/optiongt
- lt/parametersgt
48Indexing - Parameters
- ltcorpusgt - where to find your source files and
what type to expect - ltpathgt (required) the path to the source files
(absolute or relative) - ltclassgt (optional) the document type to expect.
If omitted, IndriBuildIndex will attempt to guess
at the filetype based on the files extension. - ltparametersgt
- ltcorpusgt
- ltpathgt/path/to/source/fileslt/pathgt
- ltclassgttrectextlt/classgt
- lt/corpusgt
- lt/parametersgt
49Indexing - Parameters
- The ltindexgt parameter tells IndriBuildIndex where
to create or incrementally add to the index - If index does not exist, it will create a new one
- If index already exists, it will append new
documents into the index. - ltparametersgt
- ltindexgt/path/to/the/indexlt/indexgt
- lt/parametersgt
50Indexing - Parameters
- ltmemorygt - used to define a soft-limit of the
amount of memory the indexer should use before
flushing its buffers to disk. - Use K for kilobytes, M for megabytes, and G for
gigabytes. - ltparametersgt
- ltmemorygt256Mlt/memorygt
- lt/parametersgt
51Indexing - Parameters
- Stopwords can be defined within a ltstoppergt block
with individual stopwords within enclosed in
ltwordgt tags. - ltparametersgt
- ltstoppergt
- ltwordgtfirst_wordlt/wordgt
- ltwordgtnext_wordlt/wordgt
-
- ltwordgtfinal_wordlt/wordgt
- lt/stoppergt
- lt/parametersgt
When using Web class file pay attention to
lt!...gt, lt!-- .. --gt tags
52Indexing - Parameters
- Term stemming can be used while indexing as well
via the ltstemmergt tag. - Specify the stemmer type via the ltnamegt tag
within. - Stemmers included with the Lemur Toolkit include
the Krovetz Stemmer and the Porter Stemmer. - ltparametersgt
- ltstemmergt
- ltnamegtkrovetzlt/namegt
- lt/stemmergt
- lt/parametersgt
53Indexing anchor text
- Run harvestlinks application on your data before
indexing - ltinlinkgtpath-to-linkslt/inlinkgt as a parameter to
IndriBuildIndex to index
54Retrieval
- Parameters
- Query Formatting
- Interpreting Results
55Retrieval - Parameters
- Basic usage for retrieval
- IndriRunQuery/RetEval ltparameter_filegt
- Parameter file includes options for
- Where to find the index
- The query or queries
- How much memory to use
- Formatting options
- Many other parameters.
56Retrieval - Parameters
- Just as with indexing
- A well-formed XML document with options, wrapped
by ltparametersgt tags - ltparametersgt
- ltoptionsgtlt/optionsgt
- ltoptionsgtlt/optionsgt
-
- ltoptionsgtlt/optionsgt
- lt/parametersgt
57Retrieval - Parameters
- The ltindexgt parameter tells IndriRunQuery/RetEval
where to find the repository. - ltparametersgt
- ltindexgt/path/to/the/indexlt/indexgt
- lt/parametersgt
58Retrieval - Parameters
- The ltquerygt parameter specifies a query
- plain text or using the Indri query language
- ltparametersgt
- ltquerygt
- ltnumbergt1lt/numbergt
- lttextgtthis is the first querylt/textgt
- lt/querygt
- ltquerygt
- ltnumbergt2lt/numbergt
- lttextgtanother query to runlt/textgt
- lt/querygt
- lt/parametersgt
59Retrieval - Parameters
- TREC-style topics are not directly able to be
processed via IndriRunQuery/RetEval. - Format the queries accordingly
- Format by hand
- Write a script to extract the fields
60Retrieval - Parameters
- As with indexing, the ltmemorygt parameter can be
used to define a soft-limit of the amount of
memory the retrieval system uses. - Use K for kilobytes, M for megabytes, and G for
gigabytes. - ltparametersgt
- ltmemorygt256Mlt/memorygt
- lt/parametersgt
61Retrieval - Parameters
- As with indexing, stopwords can be defined within
a ltstoppergt block with individual stopwords
within enclosed in ltwordgt tags. - ltparametersgt
- ltstoppergt
- ltwordgtfirst_wordlt/wordgt
- ltwordgtnext_wordlt/wordgt
-
- ltwordgtfinal_wordlt/wordgt
- lt/stoppergt
- lt/parametersgt
62Retrieval - Parameters
- To specify a maximum number of results to return,
use the ltcountgt tag - ltparametersgt
- ltcountgt50lt/countgt
- lt/parametersgt
63Retrieval - Parameters
- Result formatting options
- IndriRunQuery/RetEval has built in formatting
specifications for TREC and INEX retrieval tasks
64Retrieval - Parameters
- TREC Formatting directives
- ltrunIDgt a string specifying the id for a query
run, used in TREC scorable output. - lttrecFormatgt true to produce TREC scorable
output, otherwise use false (default). - ltparametersgt
- ltrunIDgtrunNamelt/runIDgt
- lttrecFormatgttruelt/trecFormatgt
- lt/parametersgt
65Retrieval - Interpreting Results
- The default output from IndriRunQuery will return
a list of results, 1 result per line, with 4
columns - ltscoregt the score of the returned document. An
Indri query will always return a negative value
for a result. - ltdocIDgt the document ID
- ltextent_begingt the starting token number of the
extent that was retrieved - ltextent_endgt the ending token number of the
extent that was retrieved
66Retrieval - Interpreting Results
- When executing IndriRunQuery with the default
formatting options, the output will look
something like - ltscoregt ltDocIDgt ltextent_begingt ltextent_endgt
- -4.83646 AP890101-0001 0 485
- -7.06236 AP890101-0015 0 385
-
67Retrieval - Evaluation
- To use trec_eval
- format IndriRunQuery results with appropriate
trec_eval formatting directives in the parameter
file - ltrunIDgtrunNamelt/runIDgt
- lttrecFormatgttruelt/trecFormatgt
- Resulting output will be in standard TREC format
ready for evaluation - ltqueryIDgt Q0 ltDocIDgt ltrankgt ltscoregt ltrunIDgt
- 150 Q0 AP890101-0001 1 -4.83646 runName
- 150 Q0 AP890101-0015 2 -7.06236 runName
68Indri IndexEnvironment
- Most important methods
- addFile adds a file of text to the index
- addString adds a document (in a text string) to
the index - addParsedDocument adds a ParsedDocument
structure to the index - setIndexedFields tells the indexer which fields
to store in the index
69Indri QueryEnvironment
- The core of the Indri API
- Includes methods for
- Opening indexes and connecting to query servers
- Running queries
- Collecting collection statistics
- Retrieving document text
- Can be used from C, Java, PHP or C
70QueryEnvrionment Opening
- Opening methods
- addIndex opens an index from the local disk
- Indri treats all open indexes as a single
collection
71QueryEnvrionment Running
- Running queries
- runQuery runs an Indri query, returns a ranked
list of results (can add a document set in order
to restrict evaluation to a few documents) - runAnnotatedQuery returns a ranked list of
results and a list of all document locations
where the query matched something
72QueryEnvrionment Retrieving
- Retrieving document text
- documents returns the full text of a set of
documents - documentMetadata returns portions of the
document (e.g. just document titles) - documentsFromMetadata returns documents that
contain a certain bit of metadata (e.g. a URL) - expressionList an inverted list for a particular
Indri query language expression
73Lemur Retrieval
74Lemur Other tasks
- Clustering ClusterDB
- Distributed IR DistMergeMethod
- Language models UnigramLM, DirichletUnigramLM,
etc.
75Getting Help
- http//www.lemurproject.org
- Central website, tutorials, documentation, news
- http//www.lemurproject.org/phorum
- Discussion board, developers read and respond to
questions - http//ciir.cs.umass.edu/strohman/indri
- My own page of Indri tips
- README file in the code distribution
-
76Indri in Action
Indexing
Search1
Search2
77Questions?Thanks For Your Attention
78Indri Query Language
- terms
- field restriction / evaluation
- numeric
- combining beliefs
- field / passage retrieval
- filters
- document priors
- http//www.lemurproject.org/lemur/IndriQueryLangua
ge.html
79Term Operations
80Field Restriction/Evaluation
81Numeric Operators
82Belief Operations
83Field/Passage Retrieval
84More Field/Passage Retrieval
- .//field for ancestor
- .\field for parent
85Filter Operations
86Document Priors
- RECENT prior built using makeprior application