Search Engines presentation

About This Presentation

Transcript and Presenter's Notes

Title: Search Engines

1
Search Engines

Hadi Amiri
Database Research Group
ECE Department, University of Tehran
email h.amiri_at_ece.ut.ac.ir
web khorshid.ut.ac.ir/h.amiri

2
Outline

Part 1. Web
Information Retrieval
Evaluation Method
Web Characteristics
Search Engines
Search Engine Query Logs
Lakes and Federated Search
Part 2. Open Source Search Engines
Lemur and Indri
Lemur Analysis

3
Outline

Part 1. Web
Information Retrieval
Evaluation Method
Web Characteristics
Search Engines
Search Engine Query Logs
Lakes and Federated Search
Part 2. Open Source Search Engines
Lemur and Indri
Lemur Analysis

4
Information Retrieval Systems

IR Deals with

Representation (LV),
Storage,
Organization and
Access
Retrieval Engine
to Information Items
5
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
6
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7
Text Operations
8
The Retrieval Process in Detail
9
The Central Problem in IR
Authors
Information Seeker
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
10
IR Evaluation

Three components of a test collection
Information Repository Collection of documents
Queries Set of information needs
Relevance Judgments Sets of documents that
satisfy the information needs

11
IR Evaluation Cont.
Query q
And some other performance measures derived from
them
12
IR Evaluation Cont.
Example
No. of Relevant Documents is 5
13
IR Evaluation Cont.
Interpolation
Interpolation
Using TREC_EVAL
14
Web Characteristics

Distributed Data
Visible and Invisible (Hidden) Web
Volatile data 40 / month
Very large volume
Very large answers
1998 3,000,000 servers, 350,000,000 pages.
2003 Only Google 3,307,998,701 pages (10 times
more)
Unstructured and redundant data. 30 are
duplicates
Quality of data
Heterogeneity
data (languages, alphabets Chinese)
Users (inexperienced)

15
Search Engines
And Their Services?
16
Search Engines
Indexed Documents
Search Engine
Web
Public Interface
D
Index
Target function
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Newest In Google
22
(No Transcript)
23
Search Engine Query Logs

Query Logs Contain
Query
Duration
IP
Clicks
Date
.

Query Log
24
Search Engine Query Logs

Trend Detection
Social Analysis
Control Trends
Term Analysis
CTD Decrement

25
Sponsored Search Results
26
Term Distribution
Frequency in query-logs
Queries
A good idea is to find the inexpensive or
middle-expensive keywords that are highly
similar to the expensive keywords! Methods such
as semantic clustering and text mining are
applicable
27
Lacks In Search Engines

Need to Locally Store Data (Documents)
Distributed Data
Visible and Invisible (Hidden) Web

Hidden Web Information is hidden from
conventional engines
No arbitrary crawl of the data (e.g., ACM
library)
Updated too frequently to be crawled (e.g.,
buy.com)
Larger than Visible Web (2-50 times) Sherman,
2001 (500 times) Bergman, 2001
Created by professionals

28
Lacks and Shortages Cont.
Different Type of Hidden Information Source

Allow access to their contents via the source
specific search interface
Allow their contents to be copied by conventional
search engines
The access to contents is subject to fee or
subscription

29
Lacks and Shortages Cont.
Some Examples

US Patent and Trademark Office (USPTO) Database
National Science Foundations Award Database
U.S. Government Printing Office (GPO) Portal
National Institutes of Healths GeneBank

1 http//www.uspto.gov/patft/index.html 2
http//www.nsf.gov/awardsearch/
3 http//www.gpoaccess.gov/databases.html 4
http//www.ncbi.nih.gov/Genbank/
30
Solution Distributed Information Retrieval or
Federated Search

The alternative to a Single-Database is a
Multi-Database model

DIR Engine
Retrieval Engine1
Retrieval Engine 2
Retrieval Engine 3
Retrieval Engine n

31
Federated Search Cont.
Cooperative and Uncooperative

Resource Description

32
Outline

Part 1. Web
Information Retrieval
Evaluation Method
Web Characteristics
Search Engines
Search Engine Query Logs
Lakes and Federated Search
Part 2. Open Source Search Engines
Lemur and Indri
Lemur Analysis

33
Open Source Search Engines

Lemur CIIR and LTI Lab
Indri CIIR and LTI Lab
Lucene
Terrier University of Glasgow
Xapian

34
Open Source Search Engines- Comparison
35
Open Source Search Engines- Comparison
36
Open Source Search Engines- Comparison
37
Open Source Search Engines- Lemur
38
Slides From

Lemur Toolkit Tutorial
Paul Ogilvie
Trevor Strohman

INDRI Overview
Don Metzler

And
39
Zoology 101

Lemurs are primates found only in Madagascar
50 species (17 are endangered)
Ring-tailed lemurs
lemur catta

40
Zoology 101

The indri is the largest type of lemur
When first spotted the natives yelled Indri!
Indri!
Malagasy for "Look! Over there!"

41
Installation

Linux, OS/X
Extract software/lemur-4.3.2.tar.gz
./configure --prefix/install/path
./make
./make install
Windows
Run software/lemur-4.3.2-install.exe
Documentation in windoc/index.html

42
Steps

Building an index
Running queries
Evaluating results

43
Indexing

Document Preparation
Indexing Parameters
Time and Space Requirements

44
Indexing Document Preparation
Document Formats The Lemur Toolkit can
inherently deal with several different document
format types without any modification

TREC Text
TREC Web
Plain Text
Microsoft Word()
Microsoft PowerPoint()

HTML
XML
PDF
Mbox

() Note Microsoft Word and Microsoft PowerPoint
can only be indexed on a Windows-based machine,
and Office must be installed.
45
Indexing Document Preparation

If your documents are not in a format that the
Lemur Toolkit can inherently process
If necessary, extract the text from the document.
Wrap the plaintext in TREC-style wrappers
ltDOCgt
ltDOCNOgtdocument_idlt/DOCNOgt
ltTEXTgt
Index this document text.
lt/TEXTgt
lt/DOCgt
or
For more advanced users, write your own parser
to extend the Lemur Toolkit.

46
Indexing - Parameters

Basic usage to build index
IndriBuildIndex ltparameter_filegt
Parameter file includes options for
Where to find your data files
Where to place the index
How much memory to use
Stopword, stemming, fields
Many other parameters.

47
Indexing - Parameters

Standard parameter file specification an XML
document
ltparametersgt
ltoptiongtlt/optiongt
ltoptiongtlt/optiongt
ltoptiongtlt/optiongt
lt/parametersgt

48
Indexing - Parameters

ltcorpusgt - where to find your source files and
what type to expect
ltpathgt (required) the path to the source files
(absolute or relative)
ltclassgt (optional) the document type to expect.
If omitted, IndriBuildIndex will attempt to guess
at the filetype based on the files extension.
ltparametersgt
ltcorpusgt
ltpathgt/path/to/source/fileslt/pathgt
ltclassgttrectextlt/classgt
lt/corpusgt
lt/parametersgt

49
Indexing - Parameters

The ltindexgt parameter tells IndriBuildIndex where
to create or incrementally add to the index
If index does not exist, it will create a new one
If index already exists, it will append new
documents into the index.
ltparametersgt
ltindexgt/path/to/the/indexlt/indexgt
lt/parametersgt

50
Indexing - Parameters

ltmemorygt - used to define a soft-limit of the
amount of memory the indexer should use before
flushing its buffers to disk.
Use K for kilobytes, M for megabytes, and G for
gigabytes.
ltparametersgt
ltmemorygt256Mlt/memorygt
lt/parametersgt

51
Indexing - Parameters

Stopwords can be defined within a ltstoppergt block
with individual stopwords within enclosed in
ltwordgt tags.
ltparametersgt
ltstoppergt
ltwordgtfirst_wordlt/wordgt
ltwordgtnext_wordlt/wordgt
ltwordgtfinal_wordlt/wordgt
lt/stoppergt
lt/parametersgt

When using Web class file pay attention to
lt!...gt, lt!-- .. --gt tags
52
Indexing - Parameters

Term stemming can be used while indexing as well
via the ltstemmergt tag.
Specify the stemmer type via the ltnamegt tag
within.
Stemmers included with the Lemur Toolkit include
the Krovetz Stemmer and the Porter Stemmer.
ltparametersgt
ltstemmergt
ltnamegtkrovetzlt/namegt
lt/stemmergt
lt/parametersgt

53
Indexing anchor text

Run harvestlinks application on your data before
indexing
ltinlinkgtpath-to-linkslt/inlinkgt as a parameter to
IndriBuildIndex to index

54
Retrieval

Parameters
Query Formatting
Interpreting Results

55
Retrieval - Parameters

Basic usage for retrieval
IndriRunQuery/RetEval ltparameter_filegt
Parameter file includes options for
Where to find the index
The query or queries
How much memory to use
Formatting options
Many other parameters.

56
Retrieval - Parameters

Just as with indexing
A well-formed XML document with options, wrapped
by ltparametersgt tags
ltparametersgt
ltoptionsgtlt/optionsgt
ltoptionsgtlt/optionsgt
ltoptionsgtlt/optionsgt
lt/parametersgt

57
Retrieval - Parameters

The ltindexgt parameter tells IndriRunQuery/RetEval
where to find the repository.
ltparametersgt
ltindexgt/path/to/the/indexlt/indexgt
lt/parametersgt

58
Retrieval - Parameters

The ltquerygt parameter specifies a query
plain text or using the Indri query language
ltparametersgt
ltquerygt
ltnumbergt1lt/numbergt
lttextgtthis is the first querylt/textgt
lt/querygt
ltquerygt
ltnumbergt2lt/numbergt
lttextgtanother query to runlt/textgt
lt/querygt
lt/parametersgt

59
Retrieval - Parameters

TREC-style topics are not directly able to be
processed via IndriRunQuery/RetEval.
Format the queries accordingly
Format by hand
Write a script to extract the fields

60
Retrieval - Parameters

As with indexing, the ltmemorygt parameter can be
used to define a soft-limit of the amount of
memory the retrieval system uses.
Use K for kilobytes, M for megabytes, and G for
gigabytes.
ltparametersgt
ltmemorygt256Mlt/memorygt
lt/parametersgt

61
Retrieval - Parameters

As with indexing, stopwords can be defined within
a ltstoppergt block with individual stopwords
within enclosed in ltwordgt tags.
ltparametersgt
ltstoppergt
ltwordgtfirst_wordlt/wordgt
ltwordgtnext_wordlt/wordgt
ltwordgtfinal_wordlt/wordgt
lt/stoppergt
lt/parametersgt

62
Retrieval - Parameters

To specify a maximum number of results to return,
use the ltcountgt tag
ltparametersgt
ltcountgt50lt/countgt
lt/parametersgt

63
Retrieval - Parameters

Result formatting options
IndriRunQuery/RetEval has built in formatting
specifications for TREC and INEX retrieval tasks

64
Retrieval - Parameters

TREC Formatting directives
ltrunIDgt a string specifying the id for a query
run, used in TREC scorable output.
lttrecFormatgt true to produce TREC scorable
output, otherwise use false (default).
ltparametersgt
ltrunIDgtrunNamelt/runIDgt
lttrecFormatgttruelt/trecFormatgt
lt/parametersgt

65
Retrieval - Interpreting Results

The default output from IndriRunQuery will return
a list of results, 1 result per line, with 4
columns
ltscoregt the score of the returned document. An
Indri query will always return a negative value
for a result.
ltdocIDgt the document ID
ltextent_begingt the starting token number of the
extent that was retrieved
ltextent_endgt the ending token number of the
extent that was retrieved

66
Retrieval - Interpreting Results

When executing IndriRunQuery with the default
formatting options, the output will look
something like
ltscoregt ltDocIDgt ltextent_begingt ltextent_endgt
-4.83646 AP890101-0001 0 485
-7.06236 AP890101-0015 0 385

67
Retrieval - Evaluation

To use trec_eval
format IndriRunQuery results with appropriate
trec_eval formatting directives in the parameter
file
ltrunIDgtrunNamelt/runIDgt
lttrecFormatgttruelt/trecFormatgt
Resulting output will be in standard TREC format
ready for evaluation
ltqueryIDgt Q0 ltDocIDgt ltrankgt ltscoregt ltrunIDgt
150 Q0 AP890101-0001 1 -4.83646 runName
150 Q0 AP890101-0015 2 -7.06236 runName

68
Indri IndexEnvironment

Most important methods
addFile adds a file of text to the index
addString adds a document (in a text string) to
the index
addParsedDocument adds a ParsedDocument
structure to the index
setIndexedFields tells the indexer which fields
to store in the index

69
Indri QueryEnvironment

The core of the Indri API
Includes methods for
Opening indexes and connecting to query servers
Running queries
Collecting collection statistics
Retrieving document text
Can be used from C, Java, PHP or C

70
QueryEnvrionment Opening

Opening methods
addIndex opens an index from the local disk
Indri treats all open indexes as a single
collection

71
QueryEnvrionment Running

Running queries
runQuery runs an Indri query, returns a ranked
list of results (can add a document set in order
to restrict evaluation to a few documents)
runAnnotatedQuery returns a ranked list of
results and a list of all document locations
where the query matched something

72
QueryEnvrionment Retrieving

Retrieving document text
documents returns the full text of a set of
documents
documentMetadata returns portions of the
document (e.g. just document titles)
documentsFromMetadata returns documents that
contain a certain bit of metadata (e.g. a URL)
expressionList an inverted list for a particular
Indri query language expression

73
Lemur Retrieval
74
Lemur Other tasks

Clustering ClusterDB
Distributed IR DistMergeMethod
Language models UnigramLM, DirichletUnigramLM,
etc.

75
Getting Help

http//www.lemurproject.org
Central website, tutorials, documentation, news
http//www.lemurproject.org/phorum
Discussion board, developers read and respond to
questions
http//ciir.cs.umass.edu/strohman/indri
My own page of Indri tips
README file in the code distribution

76
Indri in Action
Indexing
Search1
Search2
77
Questions?Thanks For Your Attention
78
Indri Query Language

terms
field restriction / evaluation
numeric
combining beliefs
field / passage retrieval
filters
document priors
http//www.lemurproject.org/lemur/IndriQueryLangua
ge.html

79
Term Operations
80
Field Restriction/Evaluation
81
Numeric Operators
82
Belief Operations
83
Field/Passage Retrieval
84
More Field/Passage Retrieval

.//field for ancestor
.\field for parent

85
Filter Operations
86
Document Priors

RECENT prior built using makeprior application

Write a Comment

User Comments (0)

About PowerShow.com

Search Engines PowerPoint PPT Presentation