Title: Mercure applied to XML retrieval
1Mercure applied to XML retrieval
Karen SauvagnatSIG-RFI team, IRIT118 route de
Narbonne31062 Toulouse Cedex 4Francesauvagnat_at_
irit.fr
2Plan
- Mercure system
- INEX approach
- General system architecture
- Indexing the INEX corpus and queries
- Retrieval
- Submitted runs
- Results
- Perspectives
3Mercure system
- Mercure is based on a connectionist model
- The system has been constantly redesigned and
updated to accommodate the information retrieval
to users needs - Frequent participation to the TREC evaluation
initiative.
4Main features of Mercure system
- It supports any text format (free text, markup
text, ) - It takes into account term proximity
- The search engine
-
- Formatting extraction of the textual part of
documents - Indexing
- Searching. Two kinds of queries
- list of terms (free text queries)
- document
- Relevance feedback blind or driven by the user
- full-text information
retrieval system - full-text information retrieval
system
5INEX APPROACH
- General system architecture
- Indexing the INEX corpus and queries
- Retrieval
- Submitted runs
- Results
6General System Architecture
7Query indexing
8Query examples
- CO query no 111
- lttitlegt natural language processing -
programming language - modeling language
human language lt/titlegt - ltkeywordsgt natural language processing, human
language technology, computational linguistics,
speech processing, parsing natural language,
natural language understanding, natural language
interface lt/keywordsgt - CAS query no 76
- lttitlegt//article.//yr2000 or .//yr1999)
AND about(.,intelligent transportation
system)//secabout(.,automationvehicle)
lt/titlegt
9Retrieval
- Searching
- using Mercure
- with or without using term proximity
- Results ordered list of 1000 documents for each
query
10Structured retrieval
- Two modules, which aim at filter the most
specific and exhaustive elements of the documents
returned by Mercure - a content-oriented-module, dealing with queries
composed of simple keywords terms (CO queries) - a content-and-structure-oriented module,
performing queries containing both explicit
references to the XML structure and content
constraints (CAS queries)
CAS queries
Ordered list of results (documents)
Formatting
Content-and-structure-oriented module
Content-oriented module
Indexing
Ordered list of results (documents/elements)
Ordered list of results (documents/elements)
Ordered list of results (documents)
11Retrieval with CO queries the content-oriented
module
- Browses through documents returned by Mercure
- Elements types that can be retrieved are
pre-specified manually, according to the DTD
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
- Searches occurrences of query terms in all
retrievable elements - Returns elements containing the greatest number
of query terms - If more than 2 elements are supposed to be
relevant, the module returns the root element
12Retrieval with CAS queries use of the
content-and-structure-oriented module
- Three parts used in the query
- target element
- content constraint
- year constraint
- Browses documents returned by Mercure
- Searches occurrences of query terms in all target
elements - Returns the target element containing the
greatest number of query terms - If the target elements do not contains any of the
terms of the content constraint, the document is
removed from the list of results - Filters elements according to the article
publication date (if there is a year constraint)
ltarticlegt ltfnogtlt/fnogt ltdoigtlt/doigt ... ltabsgtltpgtSoft
ware based on nonlinear shell theory can
simulate 3D motions related to real
fabric-manufacturing processes.
...lt/pgtlt/absgt lt/fnogt ltbdygt ltsecgt ltpgtThe
automotive and aircraft industries (among many
others) have streamlined their design and
manufacturing processes by adopting CAD/CAM
approaches. lt/pgt ... lt/secgt ltsecgt
ltstgtLiterature reviewlt/stgt ltpgt The state of the
art in fabric drape simulation can be traced to
the work of Peirce, ... lt/pgt lt/secgt lt/bdygt lt/art
iclegt
13Submitted runs
- Aim of our experiments
- test whether a full-text information retrieval
system can be easily adapted to structured
retrieval - measure the effect of term positions in the INEX
query types - 5 runs performed, using or not term proximity
- 2 for the CO task
- 2 for the SCAS task
- 1 for the VCAS task
14Results
15Discussion (1)
- Runs using term positions are definitely better
than simple search for both query types - For the CO task, elements that can be returned by
the content-oriented module are preselected
manually. - statistics or aggregation methods may be used to
find those elements automatically
16Discussion (2)
- The content-and-structure-oriented module is not
able to perform all the content and structural
constraints it processes only content
constraints on the target element and year
constraints - topic 90 //articleabout(./sec, (...)
electronic commerce (...))//absabout(.,trust
authentication) - topic 84 //pabout(.,overview distributed
query processing join) - Query processing is relatively slow, because the
modules have to browse all documents returned by
Mercure
17Perspectives
- An indexing model taking into account structural
and content information of documents seems to be
necessary - The combination of different IR measures (IDF,
IEF,..) could be tested
18 The fetch and browse model (1)
- Fetch
- Horizontal preselection of documents Di
satisfying
19 The fetch and browse model (2)
- Browse
- Vertical selection of most specific units within
the preselected documents (fetch) - Recursive case
- Stop case
failed
20Weights calculation formula for retrieval (1)
- Weight of a term i in a document j
- where
- hi parameters which depend on the documents
collection - tfi frequency of term ti in document dj
- N total number of documents
- ?l mean size of documents
- dlj size of document j
- ni number of documents which contains the term
ti
21Weights calculation formula for retrieval(2)
- Weight of a term i in a query u
- Query evaluation
Where qtfui is the frequency of term ti in query
u
22Term proximity
- Modification of the ranking function
- Documents having close query terms compute a new
input value - Where
- a is a constant parameter so that
. a is set to 4 for INEX2003 experiments - proxi,i-1 is the number of terms separating the
query terms ti and ti-1 in the window of a
terms..