Mercure applied to XML retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Mercure applied to XML retrieval

Description:

Mercure is based on a connectionist model ... Browses through documents returned by Mercure ... Browses documents returned by Mercure ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 23
Provided by: sauva
Category:

less

Transcript and Presenter's Notes

Title: Mercure applied to XML retrieval


1
Mercure applied to XML retrieval
Karen SauvagnatSIG-RFI team, IRIT118 route de
Narbonne31062 Toulouse Cedex 4Francesauvagnat_at_
irit.fr
2
Plan
  • Mercure system
  • INEX approach
  • General system architecture
  • Indexing the INEX corpus and queries
  • Retrieval
  • Submitted runs
  • Results
  • Perspectives

3
Mercure system
  • Mercure is based on a connectionist model
  • The system has been constantly redesigned and
    updated to accommodate the information retrieval
    to users needs
  • Frequent participation to the TREC evaluation
    initiative.

4
Main features of Mercure system
  • It supports any text format (free text, markup
    text, )
  • It takes into account term proximity
  • The search engine
  • Formatting extraction of the textual part of
    documents
  • Indexing
  • Searching. Two kinds of queries
  • list of terms (free text queries)
  • document
  • Relevance feedback blind or driven by the user
  • full-text information
    retrieval system
  • full-text information retrieval
    system

5
INEX APPROACH
  • General system architecture
  • Indexing the INEX corpus and queries
  • Retrieval
  • Submitted runs
  • Results

6
General System Architecture
7
Query indexing
8
Query examples
  • CO query no 111
  • lttitlegt natural language processing -
    programming language - modeling language
    human language lt/titlegt
  • ltkeywordsgt natural language processing, human
    language technology, computational linguistics,
    speech processing, parsing natural language,
    natural language understanding, natural language
    interface lt/keywordsgt
  • CAS query no 76
  • lttitlegt//article.//yr2000 or .//yr1999)
    AND about(.,intelligent transportation
    system)//secabout(.,automationvehicle)
    lt/titlegt

9
Retrieval
  • Searching
  • using Mercure
  • with or without using term proximity
  • Results ordered list of 1000 documents for each
    query

10
Structured retrieval
  • Two modules, which aim at filter the most
    specific and exhaustive elements of the documents
    returned by Mercure
  • a content-oriented-module, dealing with queries
    composed of simple keywords terms (CO queries)
  • a content-and-structure-oriented module,
    performing queries containing both explicit
    references to the XML structure and content
    constraints (CAS queries)

CAS queries
Ordered list of results (documents)
Formatting
Content-and-structure-oriented module
Content-oriented module
Indexing
Ordered list of results (documents/elements)
Ordered list of results (documents/elements)
Ordered list of results (documents)
11
Retrieval with CO queries the content-oriented
module
  • Browses through documents returned by Mercure
  • Elements types that can be retrieved are
    pre-specified manually, according to the DTD

ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
  • Searches occurrences of query terms in all
    retrievable elements
  • Returns elements containing the greatest number
    of query terms
  • If more than 2 elements are supposed to be
    relevant, the module returns the root element

12
Retrieval with CAS queries use of the
content-and-structure-oriented module
  • Three parts used in the query
  • target element
  • content constraint
  • year constraint
  • Browses documents returned by Mercure
  • Searches occurrences of query terms in all target
    elements
  • Returns the target element containing the
    greatest number of query terms
  • If the target elements do not contains any of the
    terms of the content constraint, the document is
    removed from the list of results
  • Filters elements according to the article
    publication date (if there is a year constraint)

ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
13
Submitted runs
  • Aim of our experiments
  • test whether a full-text information retrieval
    system can be easily adapted to structured
    retrieval
  • measure the effect of term positions in the INEX
    query types
  • 5 runs performed, using or not term proximity
  • 2 for the CO task
  • 2 for the SCAS task
  • 1 for the VCAS task

14
Results
15
Discussion (1)
  • Runs using term positions are definitely better
    than simple search for both query types
  • For the CO task, elements that can be returned by
    the content-oriented module are preselected
    manually.
  • statistics or aggregation methods may be used to
    find those elements automatically

16
Discussion (2)
  • The content-and-structure-oriented module is not
    able to perform all the content and structural
    constraints it processes only content
    constraints on the target element and year
    constraints
  • topic 90 //articleabout(./sec, (...)
    electronic commerce (...))//absabout(.,trust
    authentication)
  • topic 84 //pabout(.,overview distributed
    query processing join)
  • Query processing is relatively slow, because the
    modules have to browse all documents returned by
    Mercure

17
Perspectives
  • An indexing model taking into account structural
    and content information of documents seems to be
    necessary
  • The combination of different IR measures (IDF,
    IEF,..) could be tested

18
The fetch and browse model (1)
  • Fetch
  • Horizontal preselection of documents Di
    satisfying

19
The fetch and browse model (2)
  • Browse
  • Vertical selection of most specific units within
    the preselected documents (fetch)
  • Recursive case
  • Stop case

failed
20
Weights calculation formula for retrieval (1)
  • Weight of a term i in a document j
  • where
  • hi parameters which depend on the documents
    collection
  • tfi frequency of term ti in document dj
  • N total number of documents
  • ?l mean size of documents
  • dlj size of document j
  • ni number of documents which contains the term
    ti

21
Weights calculation formula for retrieval(2)
  • Weight of a term i in a query u
  • Query evaluation

Where qtfui is the frequency of term ti in query
u
22
Term proximity
  • Modification of the ranking function
  • Documents having close query terms compute a new
    input value
  • Where
  • a is a constant parameter so that
    . a is set to 4 for INEX2003 experiments
  • proxi,i-1 is the number of terms separating the
    query terms ti and ti-1 in the window of a
    terms..
Write a Comment
User Comments (0)
About PowerShow.com