Search Engines - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Search Engines

Description:

Search Engines. Hadi Amiri. Database Research Group. ECE ... Okapi BM25, InQuery, expandedBoolean. Retrieval Model. K-means, Bisecting K- means & UPGMA ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 87
Provided by: khorsh
Category:
Tags: engines | search

less

Transcript and Presenter's Notes

Title: Search Engines


1
Search Engines
  • Hadi Amiri
  • Database Research Group
  • ECE Department, University of Tehran
  • email h.amiri_at_ece.ut.ac.ir
  • web khorshid.ut.ac.ir/h.amiri

2
Outline
  • Part 1. Web
  • Information Retrieval
  • Evaluation Method
  • Web Characteristics
  • Search Engines
  • Search Engine Query Logs
  • Lakes and Federated Search
  • Part 2. Open Source Search Engines
  • Lemur and Indri
  • Lemur Analysis

3
Outline
  • Part 1. Web
  • Information Retrieval
  • Evaluation Method
  • Web Characteristics
  • Search Engines
  • Search Engine Query Logs
  • Lakes and Federated Search
  • Part 2. Open Source Search Engines
  • Lemur and Indri
  • Lemur Analysis

4
Information Retrieval Systems
  • IR Deals with

Representation (LV),
Storage,
Organization and
Access
Retrieval Engine
to Information Items
5
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
6
Inside The IR Black Box
Documents
Query
Representation Function
Representation Function
Query Representation
Document Representation
Index
Comparison Function
Hits
7
Text Operations
8
The Retrieval Process in Detail
9
The Central Problem in IR
Authors
Information Seeker
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
10
IR Evaluation
  • Three components of a test collection
  • Information Repository Collection of documents
  • Queries Set of information needs
  • Relevance Judgments Sets of documents that
    satisfy the information needs

11
IR Evaluation Cont.
Query q
And some other performance measures derived from
them
12
IR Evaluation Cont.
Example
No. of Relevant Documents is 5
13
IR Evaluation Cont.
Interpolation
Interpolation
Using TREC_EVAL
14
Web Characteristics
  • Distributed Data
  • Visible and Invisible (Hidden) Web
  • Volatile data 40 / month
  • Very large volume
  • Very large answers
  • 1998 3,000,000 servers, 350,000,000 pages.
  • 2003 Only Google 3,307,998,701 pages (10 times
    more)
  • Unstructured and redundant data. 30 are
    duplicates
  • Quality of data
  • Heterogeneity
  • data (languages, alphabets Chinese)
  • Users (inexperienced)

15
Search Engines
And Their Services?
16
Search Engines
Indexed Documents
Search Engine
Web
Public Interface
D
Index
Target function
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Newest In Google
22
(No Transcript)
23
Search Engine Query Logs
  • Query Logs Contain
  • Query
  • Duration
  • IP
  • Clicks
  • Date
  • .

Query Log
24
Search Engine Query Logs
  • Trend Detection
  • Social Analysis
  • Control Trends
  • Term Analysis
  • CTD Decrement

25
Sponsored Search Results
26
Term Distribution
Frequency in query-logs
Queries
A good idea is to find the inexpensive or
middle-expensive keywords that are highly
similar to the expensive keywords! Methods such
as semantic clustering and text mining are
applicable
27
Lacks In Search Engines
  • Need to Locally Store Data (Documents)
  • Distributed Data
  • Visible and Invisible (Hidden) Web
  • Hidden Web Information is hidden from
    conventional engines
  • No arbitrary crawl of the data (e.g., ACM
    library)
  • Updated too frequently to be crawled (e.g.,
    buy.com)
  • Larger than Visible Web (2-50 times) Sherman,
    2001 (500 times) Bergman, 2001
  • Created by professionals

28
Lacks and Shortages Cont.
Different Type of Hidden Information Source
  • Allow access to their contents via the source
    specific search interface
  • Allow their contents to be copied by conventional
    search engines
  • The access to contents is subject to fee or
    subscription

29
Lacks and Shortages Cont.
Some Examples
  • US Patent and Trademark Office (USPTO) Database
  • National Science Foundations Award Database
  • U.S. Government Printing Office (GPO) Portal
  • National Institutes of Healths GeneBank

1 http//www.uspto.gov/patft/index.html 2
http//www.nsf.gov/awardsearch/
3 http//www.gpoaccess.gov/databases.html 4
http//www.ncbi.nih.gov/Genbank/
30
Solution Distributed Information Retrieval or
Federated Search
  • The alternative to a Single-Database is a
    Multi-Database model

DIR Engine
Retrieval Engine1
Retrieval Engine 2
Retrieval Engine 3
Retrieval Engine n

31
Federated Search Cont.
Cooperative and Uncooperative
  • Resource Description

32
Outline
  • Part 1. Web
  • Information Retrieval
  • Evaluation Method
  • Web Characteristics
  • Search Engines
  • Search Engine Query Logs
  • Lakes and Federated Search
  • Part 2. Open Source Search Engines
  • Lemur and Indri
  • Lemur Analysis

33
Open Source Search Engines
  • Lemur CIIR and LTI Lab
  • Indri CIIR and LTI Lab
  • Lucene
  • Terrier University of Glasgow
  • Xapian

34
Open Source Search Engines- Comparison
35
Open Source Search Engines- Comparison
36
Open Source Search Engines- Comparison
37
Open Source Search Engines- Lemur
38
Slides From
  • Lemur Toolkit Tutorial
  • Paul Ogilvie
  • Trevor Strohman
  • INDRI Overview
  • Don Metzler

And
39
Zoology 101
  • Lemurs are primates found only in Madagascar
  • 50 species (17 are endangered)
  • Ring-tailed lemurs
  • lemur catta

40
Zoology 101
  • The indri is the largest type of lemur
  • When first spotted the natives yelled Indri!
    Indri!
  • Malagasy for "Look!  Over there!"

41
Installation
  • Linux, OS/X
  • Extract software/lemur-4.3.2.tar.gz
  • ./configure --prefix/install/path
  • ./make
  • ./make install
  • Windows
  • Run software/lemur-4.3.2-install.exe
  • Documentation in windoc/index.html

42
Steps
  • Building an index
  • Running queries
  • Evaluating results

43
Indexing
  • Document Preparation
  • Indexing Parameters
  • Time and Space Requirements

44
Indexing Document Preparation
Document Formats The Lemur Toolkit can
inherently deal with several different document
format types without any modification
  • TREC Text
  • TREC Web
  • Plain Text
  • Microsoft Word()
  • Microsoft PowerPoint()
  • HTML
  • XML
  • PDF
  • Mbox

() Note Microsoft Word and Microsoft PowerPoint
can only be indexed on a Windows-based machine,
and Office must be installed.
45
Indexing Document Preparation
  • If your documents are not in a format that the
    Lemur Toolkit can inherently process
  • If necessary, extract the text from the document.
  • Wrap the plaintext in TREC-style wrappers
  • ltDOCgt
  • ltDOCNOgtdocument_idlt/DOCNOgt
  • ltTEXTgt
  • Index this document text.
  • lt/TEXTgt
  • lt/DOCgt
  • or
  • For more advanced users, write your own parser
    to extend the Lemur Toolkit.

46
Indexing - Parameters
  • Basic usage to build index
  • IndriBuildIndex ltparameter_filegt
  • Parameter file includes options for
  • Where to find your data files
  • Where to place the index
  • How much memory to use
  • Stopword, stemming, fields
  • Many other parameters.

47
Indexing - Parameters
  • Standard parameter file specification an XML
    document
  • ltparametersgt
  • ltoptiongtlt/optiongt
  • ltoptiongtlt/optiongt
  • ltoptiongtlt/optiongt
  • lt/parametersgt

48
Indexing - Parameters
  • ltcorpusgt - where to find your source files and
    what type to expect
  • ltpathgt (required) the path to the source files
    (absolute or relative)
  • ltclassgt (optional) the document type to expect.
    If omitted, IndriBuildIndex will attempt to guess
    at the filetype based on the files extension.
  • ltparametersgt
  • ltcorpusgt
  • ltpathgt/path/to/source/fileslt/pathgt
  • ltclassgttrectextlt/classgt
  • lt/corpusgt
  • lt/parametersgt

49
Indexing - Parameters
  • The ltindexgt parameter tells IndriBuildIndex where
    to create or incrementally add to the index
  • If index does not exist, it will create a new one
  • If index already exists, it will append new
    documents into the index.
  • ltparametersgt
  • ltindexgt/path/to/the/indexlt/indexgt
  • lt/parametersgt

50
Indexing - Parameters
  • ltmemorygt - used to define a soft-limit of the
    amount of memory the indexer should use before
    flushing its buffers to disk.
  • Use K for kilobytes, M for megabytes, and G for
    gigabytes.
  • ltparametersgt
  • ltmemorygt256Mlt/memorygt
  • lt/parametersgt

51
Indexing - Parameters
  • Stopwords can be defined within a ltstoppergt block
    with individual stopwords within enclosed in
    ltwordgt tags.
  • ltparametersgt
  • ltstoppergt
  • ltwordgtfirst_wordlt/wordgt
  • ltwordgtnext_wordlt/wordgt
  • ltwordgtfinal_wordlt/wordgt
  • lt/stoppergt
  • lt/parametersgt

When using Web class file pay attention to
lt!...gt, lt!-- .. --gt tags
52
Indexing - Parameters
  • Term stemming can be used while indexing as well
    via the ltstemmergt tag.
  • Specify the stemmer type via the ltnamegt tag
    within.
  • Stemmers included with the Lemur Toolkit include
    the Krovetz Stemmer and the Porter Stemmer.
  • ltparametersgt
  • ltstemmergt
  • ltnamegtkrovetzlt/namegt
  • lt/stemmergt
  • lt/parametersgt

53
Indexing anchor text
  • Run harvestlinks application on your data before
    indexing
  • ltinlinkgtpath-to-linkslt/inlinkgt as a parameter to
    IndriBuildIndex to index

54
Retrieval
  • Parameters
  • Query Formatting
  • Interpreting Results

55
Retrieval - Parameters
  • Basic usage for retrieval
  • IndriRunQuery/RetEval ltparameter_filegt
  • Parameter file includes options for
  • Where to find the index
  • The query or queries
  • How much memory to use
  • Formatting options
  • Many other parameters.

56
Retrieval - Parameters
  • Just as with indexing
  • A well-formed XML document with options, wrapped
    by ltparametersgt tags
  • ltparametersgt
  • ltoptionsgtlt/optionsgt
  • ltoptionsgtlt/optionsgt
  • ltoptionsgtlt/optionsgt
  • lt/parametersgt

57
Retrieval - Parameters
  • The ltindexgt parameter tells IndriRunQuery/RetEval
    where to find the repository.
  • ltparametersgt
  • ltindexgt/path/to/the/indexlt/indexgt
  • lt/parametersgt

58
Retrieval - Parameters
  • The ltquerygt parameter specifies a query
  • plain text or using the Indri query language
  • ltparametersgt
  • ltquerygt
  • ltnumbergt1lt/numbergt
  • lttextgtthis is the first querylt/textgt
  • lt/querygt
  • ltquerygt
  • ltnumbergt2lt/numbergt
  • lttextgtanother query to runlt/textgt
  • lt/querygt
  • lt/parametersgt

59
Retrieval - Parameters
  • TREC-style topics are not directly able to be
    processed via IndriRunQuery/RetEval.
  • Format the queries accordingly
  • Format by hand
  • Write a script to extract the fields

60
Retrieval - Parameters
  • As with indexing, the ltmemorygt parameter can be
    used to define a soft-limit of the amount of
    memory the retrieval system uses.
  • Use K for kilobytes, M for megabytes, and G for
    gigabytes.
  • ltparametersgt
  • ltmemorygt256Mlt/memorygt
  • lt/parametersgt

61
Retrieval - Parameters
  • As with indexing, stopwords can be defined within
    a ltstoppergt block with individual stopwords
    within enclosed in ltwordgt tags.
  • ltparametersgt
  • ltstoppergt
  • ltwordgtfirst_wordlt/wordgt
  • ltwordgtnext_wordlt/wordgt
  • ltwordgtfinal_wordlt/wordgt
  • lt/stoppergt
  • lt/parametersgt

62
Retrieval - Parameters
  • To specify a maximum number of results to return,
    use the ltcountgt tag
  • ltparametersgt
  • ltcountgt50lt/countgt
  • lt/parametersgt

63
Retrieval - Parameters
  • Result formatting options
  • IndriRunQuery/RetEval has built in formatting
    specifications for TREC and INEX retrieval tasks

64
Retrieval - Parameters
  • TREC Formatting directives
  • ltrunIDgt a string specifying the id for a query
    run, used in TREC scorable output.
  • lttrecFormatgt true to produce TREC scorable
    output, otherwise use false (default).
  • ltparametersgt
  • ltrunIDgtrunNamelt/runIDgt
  • lttrecFormatgttruelt/trecFormatgt
  • lt/parametersgt

65
Retrieval - Interpreting Results
  • The default output from IndriRunQuery will return
    a list of results, 1 result per line, with 4
    columns
  • ltscoregt the score of the returned document. An
    Indri query will always return a negative value
    for a result.
  • ltdocIDgt the document ID
  • ltextent_begingt the starting token number of the
    extent that was retrieved
  • ltextent_endgt the ending token number of the
    extent that was retrieved

66
Retrieval - Interpreting Results
  • When executing IndriRunQuery with the default
    formatting options, the output will look
    something like
  • ltscoregt ltDocIDgt ltextent_begingt ltextent_endgt
  • -4.83646 AP890101-0001 0 485
  • -7.06236 AP890101-0015 0 385

67
Retrieval - Evaluation
  • To use trec_eval
  • format IndriRunQuery results with appropriate
    trec_eval formatting directives in the parameter
    file
  • ltrunIDgtrunNamelt/runIDgt
  • lttrecFormatgttruelt/trecFormatgt
  • Resulting output will be in standard TREC format
    ready for evaluation
  • ltqueryIDgt Q0 ltDocIDgt ltrankgt ltscoregt ltrunIDgt
  • 150 Q0 AP890101-0001 1 -4.83646 runName
  • 150 Q0 AP890101-0015 2 -7.06236 runName

68
Indri IndexEnvironment
  • Most important methods
  • addFile adds a file of text to the index
  • addString adds a document (in a text string) to
    the index
  • addParsedDocument adds a ParsedDocument
    structure to the index
  • setIndexedFields tells the indexer which fields
    to store in the index

69
Indri QueryEnvironment
  • The core of the Indri API
  • Includes methods for
  • Opening indexes and connecting to query servers
  • Running queries
  • Collecting collection statistics
  • Retrieving document text
  • Can be used from C, Java, PHP or C

70
QueryEnvrionment Opening
  • Opening methods
  • addIndex opens an index from the local disk
  • Indri treats all open indexes as a single
    collection

71
QueryEnvrionment Running
  • Running queries
  • runQuery runs an Indri query, returns a ranked
    list of results (can add a document set in order
    to restrict evaluation to a few documents)
  • runAnnotatedQuery returns a ranked list of
    results and a list of all document locations
    where the query matched something

72
QueryEnvrionment Retrieving
  • Retrieving document text
  • documents returns the full text of a set of
    documents
  • documentMetadata returns portions of the
    document (e.g. just document titles)
  • documentsFromMetadata returns documents that
    contain a certain bit of metadata (e.g. a URL)
  • expressionList an inverted list for a particular
    Indri query language expression

73
Lemur Retrieval
74
Lemur Other tasks
  • Clustering ClusterDB
  • Distributed IR DistMergeMethod
  • Language models UnigramLM, DirichletUnigramLM,
    etc.

75
Getting Help
  • http//www.lemurproject.org
  • Central website, tutorials, documentation, news
  • http//www.lemurproject.org/phorum
  • Discussion board, developers read and respond to
    questions
  • http//ciir.cs.umass.edu/strohman/indri
  • My own page of Indri tips
  • README file in the code distribution

76
Indri in Action
Indexing
Search1
Search2
77
Questions?Thanks For Your Attention
78
Indri Query Language
  • terms
  • field restriction / evaluation
  • numeric
  • combining beliefs
  • field / passage retrieval
  • filters
  • document priors
  • http//www.lemurproject.org/lemur/IndriQueryLangua
    ge.html

79
Term Operations
80
Field Restriction/Evaluation
81
Numeric Operators
82
Belief Operations
83
Field/Passage Retrieval
84
More Field/Passage Retrieval
  • .//field for ancestor
  • .\field for parent

85
Filter Operations
86
Document Priors
  • RECENT prior built using makeprior application
Write a Comment
User Comments (0)
About PowerShow.com