Title: Cheshire II: Features and Internals and Cheshire III overview
1Cheshire II Features and Internalsand Cheshire
III overview
Ray R. Larson School of Information Management
and Systems University of California, Berkeley
2Overview
- Cheshire II feature overview
- Logistic Regression Ranking, Okapi BM-25 and
Boolean Operations - Fusion Operators
- Additions from INEX 03
- Element/Index level re-estimation of LR
coefficients - Adhoc and Heterogeneous Track Methodology
- Evaluation Results -Adhoc
3Overview of Cheshire II
- It supports SGML and XML with components and
component indexes - It is a client/server application
- Uses the Z39.50 Information Retrieval Protocol,
support for SRW, OAI, SOAP, SDLIP also
implemented - Server supports a Relational Database Gateway
- Supports Boolean searching of all servers
- Supports probabilistic ranked retrieval in the
Cheshire search engine as well as Boolean and
proximity search - Search engine supports nearest neighbor''
searches and relevance feedback - GUI interface on X window displays and Windows NT
- WWW/CGI forms interface for DL, using combined
client/server CGI scripting via WebCheshire - Scriptable clients using Tcl and Python
- Store SGML/XML as files or Datastore database
4Cheshire II Searching
5INEX Overview
INEX Search Engine
Map Query
Local Net
Map Results
6Boolean Search Capability
- All Boolean operations are supported
- zfind author x and (title y or subject z) not
subject A - Named sets are supported and stored on the server
- Boolean operations between stored sets are
supported - zfind SET1 and subject widgets or SET2
- Nested parentheses and truncation are supported
- zfind xtitle Alice
7Probabilistic Retrieval
- Uses Logistic Regression ranking method developed
at Berkeley (W. Cooper, F. Gey, D. Dabney, A.
Chen) with new algorithm for weigh calculation at
retrieval time - Z39.50 relevance operator used to indicate
probabilistic search - Any index can have Probabilistic searching
performed - zfind topic _at_ cheshire cats, looking glasses,
march hares and other such things - zfind title _at_ caucus races
- Boolean and Probabilistic elements can be
combined - zfind topic _at_ government documents and title
guidebooks
8Probabilistic Retrieval Logistic Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At
retrieval the probability estimate is obtained by
For the 6 X attribute measures shown on the next
slide
9Probabilistic Retrieval Logistic Regression
attributes
Average Absolute Query Frequency Query
Length Average Absolute Component
Frequency Document Length Average Inverse
Component Frequency Inverse Component
Frequency Number of Terms in common between
query and Component -- logged
10Combining Boolean and Probabilistic Search
Elements
- Two original approaches
- Boolean Approach
- Non-probabilistic Fusion Search Set merger
approach is a weighted merger of document scores
from separate Boolean and Probabilistic queries
11Okapi BM25
- Where
- Q is a query containing terms T
- K is k1((1-b) b.dl/avdl)
- k1, b and k3 are parameters , usually 1.2, 0.75
and 7-1000 - tf is the frequency of the term in a specific
document - qtf is the frequency of the term in a topic from
which Q was derived - dl and avdl are the document length and the
average document length measured in some
convenient unit - w(1) is the Robertson-Sparck Jones weight.
12Merging and Ranking Operators
- Extends the capabilities of merging to include
merger operations in queries like Boolean
operators - Fuzzy Logic Operators (not used for INEX)
- !FUZZY_AND
- !FUZZY_OR
- !FUZZY_NOT
- Containment operators Restrict components to or
with a particular parent - !RESTRICT_FROM
- !RESTRICT_TO
- Merge Operators
- !MERGE_SUM
- !MERGE_MEAN
- !MERGE_NORM
- !MERGE_CMBZ
13INEX 04 Fusion Search
Subquery
Subquery
Final Ranked List
Fusion/ Merge
Subquery
Subquery
Comp. Query Results
Comp. Query Results
- Merge multiple ranked and Boolean index searches
within each query and multiple component search
resultsets - Major components merged are Articles, Body,
Sections, subsections, paragraphs
14New LR Coefficients
Index b0 b1 b2 b3 b4 b5 b6
Base -3.700 1.269 -0.310 0.679 -0.021 0.223 4.010
topic -7.758 5.670 -3.427 1.787 -0.030 1.952 5.880
topicshort -6.364 2.739 -1.443 1.228 -0.020 1.280 3.837
abstract -5.892 2.318 -1.364 0.860 -0.013 1.052 3.600
alltitles -5.243 2.319 -1.361 1.415 -0.037 1.180 3.696
sec words -6.392 2.125 -1.648 1.106 -0.075 1.174 3.632
para words -8.632 1.258 -1.654 1.485 -0.084 1.143 4.004
Estimates using INEX 03 relevance assessments
for b1 Average Absolute Query Frequency b2
Query Length b3 Average Absolute Component
Frequency b4 Document Length b5 Average
Inverse Component Frequency b6 Number of Terms
in common between query and Component
15SGML/XML Support
- Underlying native format for all data is SGML or
XML - The DTD defines the file format for each file
- Full SGML/XML parsing
- SGML/XML Format Configuration Files define the
database - USMARC DTD and MARC to SGML conversion (and back
again) - Access to full-text via special SGML/XML tags
16Indexing
- Any SGML/XML tagged field or attribute can be
indexed - B-Tree and Hash access via Berkeley DB
(Sleepycat) - Stemming, keyword, exact keys and special keys
- Mapping from any Z39.50 Attribute combination to
a specific index - Underlying postings information includes term
frequency for probabilistic searching - Component extraction with separate component
indexes
17XML Element Extraction
- A new search ElementSetName is XML_ELEMENT_
- Any Xpath, element name, or regular expression
can be included following the final underscore
when submitting a present request - The matching elements are extracted from the
records matching the search and delivered in a
simple format..
18XML Extraction
zselect sherlock 372 Connection with SHERLOCK
(sherlock.berkeley.edu) database 'bibfile' at
port 2100 is open as connection 372 zfind
topic mathematics OK Status 1 Hits 26
Received 0 Set Default RecordSyntax
UNKNOWN zset recsyntax XML zset elementset
XML_ELEMENT_Fld245 zdisplay OK Status 0
Received 10 Position 1 Set Default
NextPosition 11 RecordSyntax XML
1.2.840.10003.5.109.10 ltRESULT_DATA
DOCID"1"gt ltITEM XPATH"/USMARC1/VarFlds1/VarD
Flds1/Titles1/Fld2451"gt ltFld245
AddEnty"No" NFChars"0"gtltagtSingularitâes áa
Cargáeselt/agtlt/Fld245gt lt/ITEMgt ltRESULT_DATAgt etc
19SGML/XML Support
- Configuration files for the Server are SGML/XML
- They include elements describing all of the data
files and indexes for the database. - They also include instructions on how data is to
be extracted for indexing and how Z39.50
attributes map to the indexes for a given
database.
20SGML/XML Support
- Example XML record for a DL document
ltELIB-BIBgt ltBIB-VERSIONgtELIB-v1.0lt/BIB-VERSIONgt ltI
Dgt756lt/IDgt ltENTRYgtJune 12, 1996lt/ENTRYgt ltDATEgtJune
1996lt/DATEgt ltTITLEgtCumulative Watershed Effects
Applicability of Available Methodologies to the
Sierra Nevadalt/TITLEgt ltORGANIZATIONgtUniversity of
Californialt/ORGANIZATIONgt ltTYPEgtreportlt/TYPEgt ltAUT
HOR-INSTITUTIONALgtUSDA Forest Servicelt/AUTHOR-INST
ITUTIONALgt ltAUTHOR-PERSONALgtNeil H.
Berglt/AUTHOR-PERSONALgt ltAUTHOR-PERSONALgtKen B.
Robylt/AUTHOR-PERSONALgt ltAUTHOR-PERSONALgtBruce J.
McGurklt/AUTHOR-PERSONALgt ltPROJECTgtSNEPlt/PROJECTgt lt
SERIESgtVol 3lt/SERIESgt ltPAGESgt40lt/PAGESgt ltTEXT-REFgt
/elib/data/docs/0700/756/HYPEROCR/hyperocr.htmllt/T
EXT-REFgt ltPAGED-REFgt/elib/data/docs/0700/756/OCR-A
SCII-NOZONElt/PAGED-REFgt lt/ELIB-BIBgt
21SGML Support
ltUSMARC Material"BK" ID"00000003"gtltleadergtltLRLgt0
0722lt/LRLgtltRecStatgtnlt/RecStatgt ltRecTypegtalt/RecType
gtltBibLevelgtmlt/BibLevelgtltUCPgtlt/UCPgtltIndCountgt2lt/Ind
Countgt ltSFCountgt2lt/SFCountgtltBaseAddrgt00229lt/BaseAd
drgtltEncLevelgt lt/EncLevelgt ltDscCatFmgtlt/DscCatFmgtltLi
nkRecgtlt/LinkRecgtltEntryMapgtltFLengthgt4lt/FlengthgtltSCh
arPosgt 5lt/SCharPosgtltIDLengthgt0lt/IDLengthgtltEMUCPgtlt/
EMUCPgtlt/EntryMapgtlt/Leadergt ltDirectrygt0010014000000
05001700014008004100031010001400072035002000086035
00170010610000190012324501050014225000110024726000
32002583000033002905040050003236500036003737000022
00409700002200431950003200453998000700485lt/Directr
ygtltVarFldsgt ltVarCFldsgtltFld001gtCUBGGLAD1282Blt/Fld00
1gtltFld005gt19940414143202.0lt/Fld005gt
ltFld008gt830810 1983 nyu eng
ult/Fld008gtlt/VarCFldsgt ltVarDFldsgtltNumbCodegtltFld010
I1"Blank" I2"Blnk"gtltagt82019962 lt/agtlt/Fld010gt
ltFld035 I1"Blank" I2"Blnk"gtltagt(CU)ocm08866667lt/a
gtlt/Fld035gtltFld035 I1"Blank" I2"Blnk"gtltagt(CU)GLAD
1282lt/agtlt/Fld035gtlt/NumbCodegtltMainEntygtltFld100
NameType"Single" I2""gtltagtBurch, John
G.lt/agtlt/Fld100gtlt/MainEntygtltTitlesgtltFld245
AddEnty"Yes" NFChars"0"gtltagtInformation systems
lt/agtltbgttheory and practice /lt/bgtltcgtJohn G.
Burch, Jr., Felix R. Strater, Gary
Grudnitskilt/cgtlt/Fld245gtlt/TitlesgtltEdImprntgtltFld250
I1"Blank" I2"Blnk"gtltagt3rd edlt/agtlt/Fld250gtltFld260
I1"" I2"Blnk"gtltagtNew York lt/agtltbgtJ.
Wiley,lt/bgtltcgt1983lt/cgtlt/Fld260gtlt/EdImprntgtltPhysDesc
gtltFld300 I1"Blank" I2"Blnk"gtltagtxvi, 632 p.
lt/agtltbgtill. lt/bgtltcgt24 cmlt/cgtlt/Fld300gtlt/PhysDescgt
ltSeriesgtlt/SeriesgtltNotesgtltFld504 I1"Blank"
I2"Blnk"gtltagtIncludes bibliographical references
and indexlt/agtlt/Fld504gtlt/NotesgtltSubjAccsgtltFld650
SubjLvl"NoInfo" SubjSys"LCSH"gtltagtManagement
information systems.lt/agtlt/Fld650gt ...
22SGML/XML Support
ltDOCgt ltDOCNOgtFT931-3566lt/DOCNOgt ltPROFILEgt_AN-DCPCC
AA3FTlt/PROFILEgt ltDATEgt930316 lt/DATEgt ltHEADLINEgt FT
16 MAR 93 / Italy's Corruption Scandal
Magistrates hold key to unlocking Tangentopoli -
They will set the investigation
agenda lt/HEADLINEgt ltBYLINEgt By ROBERT
GRAHAM lt/BYLINEgt ltTEXTgt OVER the weekend the
Italian media felt obliged to comment on a
non-event. No new arrests had taken place in any
of the country's ever more numerous corruption
scandals which centre on the illicit funding of
political parties ... lt/TEXTgt ltXXgt
23 Companies- lt/XXgt ltCOgtEnte Nazionale
Idrocarburi. Ente Nazionale per L'Energia
Electtrica. Ente Partecipazioni E
Finanziamento Industria Manifatturiera.
IRI Istituto per La Ricostruzione
Industriale. lt/COgt ltXXgt Countries- lt/XXgt ltCNgtITZ
Italy, EC. lt/CNgt ltXXgt Industries- lt/XXgt ltINgtP922
2 Legal Counsel and Prosecution. P91
Executive, Legislative and General Government.
P13 Oil and Gas Extraction. P9631
Regulation, Administration of Utilities.
P6719 Holding Companies, NEC. lt/INgt ltXXgt Types- lt
/XXgt
24 ltTPgtCMMT Comment amp Analysis. GOVT
Legal issues. lt/TPgt ltPUBgtThe Financial
Times lt/PUBgt ltPAGEgt London Page 4 lt/PAGEgt lt/DOCgt
25SGML/XML Support
ltarticlegt ltfnogtC1050lt/fnogt ltdoigt10.1041/C1050s-200
0lt/doigt ltfmgt lthdrgtlthdr1gtlttigtCOMPUTING IN SCIENCE
amp ENGINEERINGlt/tigt ltcrtgtltissngt1521-9615lt/issngt
/00/10.00 ltccigtltonmgtcopy 2000
IEEElt/onmgtlt/ccigtlt/crtgtlt/hdr1gt lthdr2gtltobigtltvolnogtVo
l. 2lt/volnogtltissnogtNo. 1lt/issnogtlt/obigt ltpdtgtltmogtJA
NUARY/FEBRUARYlt/mogtltyrgt2000lt/yrgtlt/pdtgt ltppgtpp.
50-59lt/ppgtlt/hdr2gt lt/hdrgt lttiggtltatlgtThe
Decompositional Approach to Matrix
Computationlt/atlgt ltpngtpp. 50-59lt/pngtlt/tiggt ltau
sequence"first"gtltfnmgtG.W.lt/fnmgtltsnmgtStewartlt/snmgt
ltaffgtltonmgtUniversity of Marylandlt/onmgtlt/affgtlt/augt
ltfiggtltart file"c1050x1.gif" w"425" h"321"
tw"150" th"113"/gtlt/figgt ltabsgtltpgtThe
introduction of matrix decomposition into
numerical linear algebra revolutionized matrix
computations. This article outlines the
decompositional approach, comments on its
history, and surveys the six most widely used
decompositions.lt/pgt lt/absgt lt/fmgt ltbdygt ltsecgtltstgtlt/
stgt ltip1gtIn 1951, Paul S. Dwyer published
ltitgtLinear Computationslt/itgt, perhaps the first
book devoted entirely to numerical linear
algebra.ltref rid"bibc10501" type"bib"gt1lt/refgt
Digital computing was in its infancy, and Dwyer
focused on computation with mechanical
calculators. Nonetheless, the book was state of
the art. ltref rid"c10501" type"fig"gtFigure
1lt/refgt reproduces a page of the book dealing
with Gaussian elimination. In 1954, Alston S.
Householder published ltitgtPrinciples of Numerical
Analysislt/itgt,ltref rid"bibc10502"
type"bib"gt2lt/refgt one of the first modern
treatments of high-speed digital computation.
ltref rid"c10502" type"fig"gtFigure 2lt/refgt
reproduces a page from this book, also dealing
with Gaussian elimination.lt/ip1gt ltfig
id"c10501"gtltart file"c10501.gif" w"600"
h"970" tw"150" th"243"/gtltnogt1lt/nogtltfgcgtThis
page from ltitgtLinear Computationslt/itgt shows
that Paul Dwyer's approach begins with a system
of scalar equations. Courtesy of John Wiley amp
Sons.lt/fgcgtlt/figgt ltfig id"c10502"gtltart
file"c10502.gif" w"500" h"807" tw"150"
th"242"/gtltnogt2lt/nogtltfgcgtOn this page from
ltitgtPrinciples of Numerical Analysislt/itgt,
Alston Householder uses partitioned matrices and
LU decomposition. Courtesy of McGraw-Hill.lt/fgcgtlt/
figgt ltpgtThe contrast between these two excerpts
is striking. The most obvious difference is that
Dwyer used scalar equations whereas Householder
used partitioned matrices.
26SGML/XML Support
ltsecgtltstgtCONCLUSIONlt/stgt ltip1gtThe big six are
not the only decompositions in use in fact,
there are many more. As mentioned earlier,
certain intermediate formsmdashsuch as
tridiagonal and Hessenberg formsmdashhave come
to be regarded as decompositions in their own
right. Since the singular value decomposition is
expensive to compute and not readily updated,
rank-revealing alternatives have received
considerable attention.ltref rid"bibc105054"
type"bib"gt54lt/refgtltsupergt,lt/supergtltref
rid"bibc105055" type"bib"gt55lt/refgt There are
also generalizations of the singular value
decomposition and the Schur decomposition for
pairs of matrices. ltref rid"bibc105056"
type"bib"gt56lt/refgtltsupergt,lt/supergtltref
rid"bibc105057" type"bib"gt57lt/refgt All crystal
balls become cloudy when they look to the future,
but it seems safe to say that as long as new
matrix problems arise, new decompositions will
be devised to solve them.lt/ip1gt lt/secgt lt/bdygt ltbmgt
ltackgtlthgtAcknowledgmentlt/hgt ltip1gtltitgtThis work
was supported by the National Science Foundation
under Grant No. 970909-8562.lt/itgtlt/ip1gt lt/ackgt ltbi
bgtltbiblgtlthgtReferenceslt/hgt ltbb id"bibc10501"gtltaugtlt
fnmgtP.S.lt/fnmgtltsnmgtDwyerlt/snmgtlt/augtlttigtLinear
Computations,lt/tigt ltobigtJohn Wiley amp
Sons,lt/obigtltlocgtltctygtNew York,lt/ctygtlt/locgtltpdtgtltyr
gt1951.lt/yrgtlt/pdtgtlt/bbgt ltbb id"bibc10502"gtltaugtltfnm
gtA.S.lt/fnmgtltsnmgtHouseholderlt/snmgtlt/augtlttigtPrincipl
es of Numerical Analysis,lt/tigt ltobigtMcGraw-Hill,lt/
obigtltlocgtltctygtNew York,lt/ctygtlt/locgtltpdtgtltyrgt1953.lt
/yrgtlt/pdtgtlt/bbgt ltbb id"bibc10503"gtltaugtltfnmgtJ.H.lt/
fnmgtltsnmgtWilkinsonlt/snmgtlt/augtltobigtandlt/obigt ltaugtltf
nmgtC.lt/fnmgtltsnmgtReinschlt/snmgtlt/augtlttigtHandbook
for Automatic Computation, Vol. II, Linear
Algebra,lt/tigt ltobigtSpringer-Verlag,lt/obigtltlocgtltcty
gtNew York,lt/ctygtlt/locgtltpdtgtltyrgt1971.lt/yrgtlt/pdtgtlt/b
bgt ltbb id"bibc10504"gtltaugtltfnmgtB.S.lt/fnmgtltsnmgtGarb
owlt/snmgtlt/augt ltobigtet al.,lt/obigtltatlgt"Matrix
Eigensystem RoutinesmdashEispack Guide
Extension,"lt/atlgt lttigtLecture Notes in Computer
Science,lt/tigtltobigtSpringer-Verlag,lt/obigtltlocgtltctygt
New York,lt/ctygtlt/locgtltpdtgt ltyrgt1977.lt/yrgtlt/pdtgtlt/b
bgt ltbb id"bibc10505"gtltaugtltfnmgtJ.J.lt/fnmgtltsnmgtDong
arralt/snmgtlt/augtltobigtet al.,lt/obigt lttigtLINPACK
User's Guide,lt/tigt ltobigtSIAM,lt/obigtltlocgtltctygtPhila
delphia,lt/ctygtlt/locgtltpdtgtltyrgt1979.lt/yrgtlt/pdtgtlt/bbgt
27SGML/XML Support
lt?xml version"1.0" encoding"ISO-8859-1"?gt lt!DOCT
YPE inex_topic SYSTEM "topic.dtd"gt ltinex_topic
topic_id"70" query_type"CAS" ct_no"49"gt lttitlegt
/articleabout(./fm/abs,'"information retrieval"
"digital libraries"')lt/titlegt ltdescri
ptiongtRetrieve articles with an abstract
indicating the article is about information
retrieval and/or digital librarieslt/descriptiongt lt
narrativegtTo be relevant the retrieved articles
must be about information retrieval, digital
libraries or, preferably both. Articles about
information retrieval from digital libraries will
receive the highest relevance judgements.lt/narrati
vegt ltkeywordsgtinformation retrieval,digital
librarieslt/keywordsgt lt/inex_topicgt
28SGML/XML Support
- Configuration files for the Server are also
SGML/XML - They include tags describing all of the data
files and indexes for the database. - They also include instructions on how data is to
be extracted for indexing and how Z39.50
attributes map to the indexes for a given
database.
29Cheshire Configuration Files
lt!--
--gt lt!--
TREC INTERACTIVE TEST
DB --gt lt!--
--gt lt!-- This is the config file for the Cheshire
II TREC interactive Database --gt ltDBCONFIGgt ltDBENV
gt/projects/is240/GroupX/indexes lt/DBENVgt lt!--
--gt lt!-- TREC TEST
DATABASE FILEDEF --gt lt!--
--gt lt!-- The Interactive TREC Financial Times
datafile --gt ltFILEDEF TYPESGMLgt ltDEFAULTPATHgt/pr
ojects/is240/GroupX lt/DEFAULTPATHgt lt!-- filetag
is the "shorthand" name of the file --gt ltFILETAGgt
trec lt/FILETAGgt lt!-- filename is the full path
name of the main data directory --gt ltFILENAMEgt
/projects/is240/ft lt/FILENAMEgt ltCONTINCLUDEgt
/projects/is240/ft.CONT lt/CONTINCLUDEgt lt!--
fileDTD is the full path name of the file's DTD
--gt ltFILEDTDgt /projects/is240/TREC.FT.DTD
lt/FILEDTDgt lt!-- assocfil is the full path name of
the file's Associator --gt ltASSOCFILgt ft.assoc
lt/ASSOCFILgt lt!-- history is the full path name
of the file's history file --gt ltHISTORYgt
cheshire_index/TESTDATA.history lt/HISTORYgt
30Indexing
- Any SGML/XML tagged field or attribute can be
indexed - B-Tree and Hash access via Berkeley DB
(Sleepycat) - Stemming, keyword, exact keys and special keys
- Mapping from any Z39.50 Attribute combination to
a specific index - Underlying postings information includes term
frequency for probabilistic searching. - SGML may include address of full-text for
indexing - New indexes can be easily added, or old ones
deleted
31Bitmapped Indexes
- Bitmap indexes can be used for Boolean operations
where the data has only a few values and very
large numbers of items with each value - Only one bit per record stored in the index
- Processed on a demand basis so only blocks with
the bits needed to resolve a query are fetched
32lt!-- The following are the index definitions for
the file --gtltINDEXESgtlt!--
--gtlt!-- DOC NO.
--gtlt!--
--gtlt!-- The following provides
document number access.
--gtltINDEXDEF ACCESSBTREE EXTRACTKEYWORD
NORMALNONE PRIMARYKEYIGNOREgtltINDXNAMEgt
cheshire_index/trec.docno.index
lt/INDXNAMEgtltINDXTAGgt docno lt/INDXTAGgtltINDXMAPgt
ltUSEgt 12 lt/USEgtltstructgt 1 lt/structgt
lt/INDXMAPgtltINDXMAPgt ltUSEgt 12 lt/USEgtltstructgt 2
lt/structgt lt/INDXMAPgtltINDXMAPgt ltUSEgt 12
lt/USEgtltstructgt 6 lt/structgt lt/INDXMAPgtltINDXKEYgtltT
AGSPECgtltFTAGgtDOCNO lt/FTAGgtlt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt
33 lt!--
--gt lt!--
TOPIC
--gt lt!--
--gt lt!-- The following is the
primary index for probabilistic searches
--gt lt!-- It includes headlines, datelines,
bylines, and full text --gt ltINDEXDEF
ACCESSBTREE EXTRACTKEYWORD_PROXIMITY
NORMALSTEMgt ltINDXNAMEgt cheshire_index/trec.topic.
index lt/INDXNAMEgt ltINDXTAGgt topic
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 29 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXMAP
gt ltUSEgt 29 lt/USEgtltRELATgt 102 lt/RELATgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPL
ISTgt cheshire_index/topicstoplist
lt/STOPLISTgt ltINDXKEYgt ltTAGSPECgt ltFTAGgtHEADLINE
lt/FTAGgt ltFTAGgtDATELINE lt/FTAGgt ltFTAGgtBYLINE
lt/FTAGgt ltFTAGgtTEXT lt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
34Cheshire II EVI Generation
- Entry Vocabulary Indexes can improve access to
data with controlled index terms - Define basis for clustering records.
- Select field to form the basis of the cluster.
- Evidence Fields to use as contents of the
pseudo-documents. - During indexing cluster keys are generated with
basis and evidence from each record. - Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields. - Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.
35EVI/Cluster Definitions
lt!-- CLUSTER
--gt lt!--
DEFINITIONS
--gt ltCLUSTERgt ltclusnamegt classcluster
lt/clusnamegt ltcluskey normalCLASSCLUSgt
lttagspecgt ltFTAGgtFLD950 lt/FTAGgt ltsgt a
lt/sgt lt/tagspecgt lt/cluskeygt ltstoplistgt
/usr3/cheshire2/data2/clasclusstoplist
lt/stoplistgt ltclusmapgt ltfromgt
lttagspecgt ltftaggtFLD245lt/ftaggtltsgtablt/sgt ltfta
ggtFLD440lt/ftaggtltsgtalt/sgt ltftaggtFLD490lt/ftaggtltsgt
alt/sgt ltftaggtFLD830lt/ftaggtltsgtalt/sgt ltftaggtFLD74
0lt/ftaggtltsgtalt/sgt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt ltftaggttitleslt/ftaggt
lt/tagspecgtlt/togt ltfromgt lttagspecgt ltftaggtFLD6.
.lt/ftaggtltsgtabcdxyzlt/sgt
lt/tagspecgtlt/fromgt lttogt lttagspecgt ltftaggts
ubjectslt/ftaggt lt/tagspecgtlt/togt lts
ummarizegt ltmaxnumgt 5 lt/maxnumgt lttagspecgt
ltftaggtsubjsumlt/ftaggt lt/tagspecgtlt/summ
arizegt lt/clusmapgt lt/CLUSTERgt
36Component Extraction and Indexing
- Any element (or range of SGML/XML data starting
with one element and ending with another) can be
defined as a component and accessed and indexed
as if it were an entire document. - Component indexes and document-level indexes can
be combined in search operations (and special
operators permit selection of document or
components as the result
37Component Definitions
ltCOMPONENTSgt ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
TESTDATA/COMPONENT_DB1 lt/COMPONENTNAMEgt ltCOMPONENT
NORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt
ltTAGSPECgt ltFTAGgtmainenty lt/FTAGgt
ltFTAGgttitles lt/FTAGgt lt/TAGSPECgt lt/COMPSTARTTAGgt lt
COMPENDTAGgt ltTAGSPECgtltFTAGgtFld300
lt/FTAGgtlt/TAGSPECgt lt/COMPENDTAGgt ltCOMPONENTINDEXESgt
lt!-- First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALNONEgt ltINDXNAMEgt
TESTDATA/comp1index1.author lt/INDEXDEFgt lt/COMPO
NENTDEFgt lt/COMPONENTSgt
38Result Formatting (Display)
ltDISPOPTIONSgt KEEP_ENTITIES lt/DISPOPTIONSgt ltDISPL
AYgt ltFORMAT NAME"B" OID"1.2.840.10003.5.105"
DEFAULTgt ltconvert function"TAGSET-G"gt
ltclusmapgt ltfromgt
lttagspecgt ltftaggtDOCNOlt/ftaggt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt
ltftaggt28lt/ftaggt lt/tagspecgtlt/togt
ltfromgt lttagspecgt
ltftaggtDOCIDlt/ftaggt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt ltftaggt5lt/ftaggt
lt/tagspecgtlt/togt lt/clusmapgt
lt/convertgt lt/FORMATgt lt/DISPLAYgt
39INEX Configuration Example
lt!--
--gt lt!--
Config for INEX evaluation
--gt lt!--
--gt lt!-- This is the config file for the Cheshire
II TREC interactive Database --gt lt!-- new version
uses proximity indexes... --gt ltDBCONFIGgt ltDBENVgt/
projects/metadata/cheshire/TREC/cheshire_index
lt/DBENVgt lt!--
--gt lt!-- INEX TEST DATABASE FILEDEF --gt lt!--
--gt ltFILEDEF
TYPEXMLgt ltDEFAULTPATHgt /projects/metadata/cheshir
e/INEX lt/DEFAULTPATHgt lt!-- filetag is the
"shorthand" name of the file --gt ltFILETAGgt INEX
lt/FILETAGgt lt!-- filename is the full path name
of the main data directory --gt ltFILENAMEgt
inex-1.3/xml lt/FILENAMEgt ltCONTINCLUDEgt
inex-1.3/xml_main.cont lt/CONTINCLUDEgt lt!--
fileDTD is the full path name of the file's DTD
--gt ltFILEDTDgt inex-1.3/dtd/wrapper.dtd
lt/FILEDTDgt ltSGMLCATgt inex-1.3/dtd/catalog
lt/SGMLCATgt lt!-- assocfil is the full path name
of the file's Associator --gt ltASSOCFILgt
inex-1.3/xml_main.assoc lt/ASSOCFILgt lt!-- history
is the full path name of the file's history file
--gt ltHISTORYgt inex.history lt/HISTORYgt
40INEX Configuration Example
lt!-- The following are the index definitions for
the file --gt ltINDEXESgt lt!--
--gt lt!-- DOC NO.
--gt lt!--
--gt lt!-- The following provides
document number access.
--gt ltINDEXDEF ACCESSBTREE EXTRACTEXACTKEY
NORMALDO_NOT_NORMALIZE PRIMARYKEYIGNOREgt ltIN
DXNAMEgt indexes/docno.index lt/INDXNAMEgt ltINDXTAGgt
docno lt/INDXTAGgt ltINDXMAPgt ltUSEgt 12
lt/USEgtltstructgt 1 lt/structgt lt/INDXMAPgt ltINDXMAPgt lt
USEgt 12 lt/USEgtltstructgt 2 lt/structgt
lt/INDXMAPgt ltINDXMAPgt ltUSEgt 12 lt/USEgtltstructgt 6
lt/structgt lt/INDXMAPgt ltINDXKEYgt ltTAGSPECgt ltFTAGgt
doi lt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
41INEX Configuration Example
lt!--
--gt lt!--
PERSONAL AUTHOR/BYLINE
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD
NORMALNONEgt ltINDXNAMEgt indexes/pauthor.index lt/I
NDXNAMEgt ltINDXTAGgt pauthor lt/INDXTAGgt lt!-- The
following INDXMAP items provide a mapping from
the AUTHOR tag to --gt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt 1
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt ltINDXMAPgt ltUSEgt 1004 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt lt!--
The stoplist for this file --gt ltSTOPLISTgt
indexes/authorstoplist lt/STOPLISTgt lt!-- The
INDXKEY area contains the specifications of tags
in the doc --gt lt!-- that are to be extracted and
indexed for this index --gt ltINDXKEYgt ltTAGSPECgt
ltFTAGgtfmlt/FTAGgtltSgtault/SgtltSgtsnmlt/Sgt ltFTAGgtfmlt/FTAGgt
ltSgtault/SgtltSgtfnmlt/Sgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
42INEX Configuration Example
lt!--
--gt lt!--
TITLE/HEADLINE
--gt lt!--
--gt lt!-- The following provides
keyword title access
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROX
NORMALSTEMgt ltINDXNAMEgt indexes/title.index
lt/INDXNAMEgt ltINDXTAGgt title lt/INDXTAGgt ltINDXMAPgt
ltUSEgt 4 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt ltINDXMAPgt ltUSEgt 5
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt ltINDXMAPgt ltUSEgt 6 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPLIS
Tgt indexes/titlestoplist lt/STOPLISTgt ltINDXKEYgt ltTA
GSPECgt ltFTAGgtfmlt/FTAGgtltSgttiglt/SgtltSgtatllt/Sgt lt/TAGSP
ECgt lt/INDXKEYgt lt/INDEXDEFgt
43INEX Configuration Example
lt!--
--gt lt!--
TOPIC
--gt lt!--
--gt lt!-- The following is the
primary index for probabilistic searches
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROX
NORMALSTEMgt ltINDXNAMEgt indexes/topic.index
lt/INDXNAMEgt ltINDXTAGgt topic lt/INDXTAGgt ltINDXMAPgt
ltUSEgt 29 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt ltINDXMAPgt ltUSEgt 1017
lt/USEgtltRELATgt 102 lt/RELATgtltPOSITgt 3 lt/positgt
ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPLISTgt
indexes/topicstoplist lt/STOPLISTgt ltINDXKEYgt ltTAGSP
ECgt ltFTAGgtfmlt/FTAGgtltSgttiglt/SgtltSgtatllt/Sgt ltFTAGgtabslt
/FTAGgt ltFTAGgtbdylt/FTAGgt ltFTAGgtbibllt/FTAGgtltSgtbblt/Sgt
ltSgtatllt/Sgt ltFTAGgtapplt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt
44INEX Configuration Example
lt!--
--gt lt!--
DATE
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTDATE NORMALYEARgt ltINDXNAMEgt
indexes/date.index lt/INDXNAMEgt ltINDXTAGgt date lt/I
NDXTAGgt lt!-- The following INDXMAP items provide
a mapping from the AUTHOR tag to --gt lt!-- the
appropriate Z39.50 BIB1 attribute numbers
--gt ltINDXMAPgt ltUSEgt 30 lt/USEgtltPOSITgt 3 lt/positgt
ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXMAPgt ltUSEgt
30 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 5
lt/structgt lt/INDXMAPgt ltINDXKEYgt ltTAGSPECgt ltFTAGgth
dr2lt/FTAGgtltsgtyrlt/sgt lt/TAGSPECgt lt/INDXKEYgt lt/INDEXD
EFgt
45INEX Configuration Example
lt!--
--gt lt!--
JOURNAL
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALNONEgt ltINDXNAMEgt
indexes/journal.index lt/INDXNAMEgt ltINDXTAGgt journ
al lt/INDXTAGgt ltINDXMAPgt ltUSEgt 1022 lt/USEgtltPOSITgt
3 lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXMA
Pgt ltUSEgt 1022 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 5
lt/structgt lt/INDXMAPgt ltINDXKEYgt ltTAGSPECgt ltFTAGgth
dr1lt/FTAGgtltsgttilt/sgt lt/TAGSPECgt lt/INDXKEYgt lt/INDEXD
EFgt
46INEX Configuration Example
lt!--
--gt lt!--
KEYWORDS
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROXIMI
TY NORMALSTEMgt ltINDXNAMEgt indexes/keywords.index
lt/INDXNAMEgt ltINDXTAGgt kwd lt/INDXTAGgt ltINDXMAPgt ltU
SEgt 3121 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt ltSTOPLISTgt
indexes/topicstoplist lt/STOPLISTgt ltINDXKEYgt ltTAGSP
ECgt ltFTAGgtkwdlt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
47INEX Configuration Example
lt!--
--gt lt!--
ABSTRACT
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROXIMI
TY NORMALSTEMgt ltINDXNAMEgt indexes/abstract.index
lt/INDXNAMEgt ltINDXTAGgt abstract lt/INDXTAGgt ltINDXMA
Pgt ltUSEgt 62 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt ltSTOPLISTgt
indexes/topicstoplist lt/STOPLISTgt ltINDXKEYgt ltTAGSP
ECgt ltFTAGgtabslt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
48INEX Configuration Example
lt!-- The following index has contents of the
SEQUENCE attribute of the --gt lt!-- au (author)
tag either "first" or "additional" --gt
ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD
NORMALNONEgt ltINDXNAMEgt indexes/author_seq.index
lt/INDXNAMEgt ltINDXTAGgt author_seq
lt/INDXTAGgt ltINDXKEYgt ltTAGSPECgt ltFTAGgtfmlt/FTAGgtltSgta
ult/SgtltATTRgtsequencelt/ATTRgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
49INEX Configuration Example
lt!--
--gt lt!--
Bib author Forename
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD
NORMALNONEgt ltINDXNAMEgt indexes/bib_author_fnm.in
dex lt/INDXNAMEgt ltINDXTAGgt bib_author_fnm
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 1000 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXKEY
gt ltTAGSPECgt ltFTAGgtbblt/FTAGgtltsgtault/sgtltsgtfnmlt/sgt
lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
50INEX Configuration Example
lt!--
--gt lt!--
Bib author surname
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD
NORMALNONEgt ltINDXNAMEgt indexes/bib_author_snm.in
dex lt/INDXNAMEgt ltINDXTAGgt bib_author_snm
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 1000 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXKEY
gt ltTAGSPECgt ltFTAGgtbblt/FTAGgtltsgtault/sgtltsgtsnmlt/sgt
lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
51INEX Configuration Example
lt!--
--gt lt!--
FIGURES
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALSTEMgt ltINDXNAMEgt
indexes/fig.index lt/INDXNAMEgt ltINDXTAGgt fig
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3150 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPLI
STgt indexes/topicstoplist lt/STOPLISTgt ltINDXKEYgt ltT
AGSPECgt ltFTAGgtfiglt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
52INEX Configuration Example
lt!--
--gt lt!--
acknowledgements
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALSTEMgt ltINDXNAMEgt
indexes/ack.index lt/INDXNAMEgt ltINDXTAGgt ack
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3188 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPLI
STgt indexes/topicstoplist lt/STOPLISTgt ltINDXKEYgt ltT
AGSPECgt ltFTAGgtacklt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
53INEX Configuration Example
lt!--
--gt lt!--
alltitles
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/alltitles.index lt/INDXNAMEgt ltINDXTAGgt
alltitles lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3188
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt ltSTOPLISTgt indexes/titlestoplist
lt/STOPLISTgt ltINDXKEYgt ltTAGSPECgt ltFTAGgtatllt/FTAGgt lt
FTAGgtstlt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt
54INEX Configuration Example
lt!--
--gt lt!--
Affiliation
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALNONEgt ltINDXNAMEgt
indexes/affil.index lt/INDXNAMEgt ltINDXTAGgt affil
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3189 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltSTOPLI
STgt indexes/titlestoplist lt/STOPLISTgt ltINDXKEYgt ltT
AGSPECgt ltFTAGgtfmlt/FTAGgtltsgtafflt/sgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt
55INEX Configuration Example
lt!--
--gt lt!--
FNO
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALnonegt ltINDXNAMEgt
indexes/fno.index lt/INDXNAMEgt ltINDXTAGgt fno
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3192 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXKEY
gt ltTAGSPECgt ltFTAGgtfnolt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt
56INEX Configuration Example
lt!--
--gt lt!--
FIGNO
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTINTEGER NORMALNONEgt ltINDXNAMEgt
indexes/figno.index lt/INDXNAMEgt ltINDXTAGgt figno
lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3193 lt/USEgtltPOSITgt 3
lt/positgt ltstructgt 6 lt/structgt lt/INDXMAPgt ltINDXKEY
gt ltTAGSPECgt ltFTAGgtfiglt/FTAGgtltsgtnolt/sgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt
57INEX Configuration Example
lt!--
--gt lt!--
topicshort
--gt lt!--
--gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/topicshort.index lt/INDXNAMEgt ltINDXTAGgt
topicshort lt/INDXTAGgt ltINDXMAPgt ltUSEgt 3192
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt ltINDXKEYgt ltTAGSPECgt ltFTAGgtfmlt/FTAGgtltSgt
tiglt/SgtltSgtatllt/Sgt ltFTAGgtabslt/FTAGgt ltFTAGgtkwdlt/FTAG
gt ltFTAGgtstlt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt lt/INDEXESgt
58INEX Configuration Example
ltCOMPONENTSgt ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
indexes/COMPONENT_SECTION lt/COMPONENTNAMEgt ltCOMPON
ENTNORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt ltTAGSP
ECgt ltFTAGgtseclt/FTAGgt lt/TAGSPECgt lt/COMPSTARTTAGgt ltC
OMPONENTINDEXESgt lt!-- First index def
--gt ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROXIM
ITY NORMALNONEgt ltINDXNAMEgt indexes/sec_title2.in
dex lt/INDXNAMEgt ltINDXTAGgt sec_title
lt/INDXTAGgt lt!-- the appropriate Z39.50 BIB1
attribute numbers --gt ltINDXMAPgt ltUSEgt 38
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/titlestoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtseclt/FTAGgtltsgtstlt/sgt
lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
59INEX Configuration Example
ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROXIMITY
NORMALSTEMgt ltINDXNAMEgt indexes/sec_words.index lt
/INDXNAMEgt ltINDXTAGgt sec_words lt/INDXTAGgt lt!--
the appropriate Z39.50 BIB1 attribute numbers
--gt ltINDXMAPgt ltUSEgt 39 lt/USEgtltPOSITgt 3 lt/positgt
ltstructgt 6 lt/structgt lt/INDXMAPgt lt!-- The
stoplist for this file --gt ltSTOPLISTgt
indexes/topicstoplist lt/STOPLISTgt lt!-- The
INDXKEY area contains the specifications of tags
in the doc --gt lt!-- that are to be extracted and
indexed for this index --gt ltINDXKEYgt ltTAGSPECgt
ltFTAGgtseclt/FTAGgt lt/TAGSPECgt lt/INDXKEYgt
lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/COMPONENTDEFgt
60INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt indexes/COMPONENT_B
IB lt/COMPONENTNAMEgt ltCOMPONENTNORMgtNONElt/COMPONENT
NORMgt ltCOMPSTARTTAGgt ltTAGSPECgt ltFTAGgtbmlt/FTAGgtltSgtb
iblt/Sgtltsgtbibllt/sgtltsgtbblt/sgt lt/TAGSPECgt lt/COMPSTARTT
AGgt lt!-- / no end tag / --gt ltCOMPONENTINDEXESgt lt
!-- First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALNONEgt ltINDXNAMEgt
indexes/bib_author.index lt/INDXNAMEgt ltINDXTAGgt
bib_author lt/INDXTAGgt lt!-- The following INDXMAP
items provide a mapping from the AUTHOR tag to
--gt lt!-- the appropriate Z39.50 BIB1 attribute
numbers --gt ltINDXMAPgt ltUSEgt 1000
lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtault/FTAGgt
lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
61INEX Configuration Example
ltINDEXDEF ACCESSBTREE EXTRACTKEYWORD_PROXIMITY
NORMALNONEgt ltINDXNAMEgt indexes/bib_title.index lt
/INDXNAMEgt ltINDXTAGgt bib_title lt/INDXTAGgt lt!--
The following INDXMAP items provide a mapping
from the AUTHOR tag to --gt lt!-- the appropriate
Z39.50 BIB1 attribute numbers --gt ltINDXMAPgt
ltUSEgt 33 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt lt!-- The INDXKEY area
contains the specifications of tags in the doc
--gt lt!-- that are to be extracted and indexed for
this index --gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtatllt/F
TAGgt lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt
62INEX Configuration Example
ltINDEXDEF ACCESSBTREE EXTRACTDATE
NORMALYEARgt ltINDXNAMEgt indexes/bib_date.index lt/
INDXNAMEgt ltINDXTAGgt bib_date lt/INDXTAGgt lt!-- The
following INDXMAP items provide a mapping from
the AUTHOR tag to --gt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
31 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtpdtlt/FTAGgtltsgtyrlt/sgt
lt/TAGSPECgt lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTI
NDEXESgt lt/COMPONENTDEFgt
63INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
indexes/COMPONENT_PARAS lt/COMPONENTNAMEgt ltCOMPONEN
TNORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt ltTAGSPEC
gt ltFTAGgtilrjip1ip2ip3ip4ip5item
-nonepp1p2p3tmathtflt/FTAGgt lt/TA
GSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/para_words.index lt/INDXNAMEgt ltINDXTAGgt
para_words lt/INDXTAGgt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
39 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/topicstoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgt.lt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/C
OMPONENTDEFgt
64INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt indexes/COMPONENT_F
IG lt/COMPONENTNAMEgt ltCOMPONENTNORMgtNONElt/COMPONENT
NORMgt ltCOMPSTARTTAGgt ltTAGSPECgt ltFTAGgtfiglt/FTAGgt lt/
TAGSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD NORMALNONEgt ltINDXNAMEgt
indexes/fig_caption.index lt/INDXNAMEgt ltINDXTAGgt
fig_caption lt/INDXTAGgt lt!-- the appropriate
Z39.50 BIB1 attribute numbers --gt ltINDXMAPgt
ltUSEgt 38 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6
lt/structgt lt/INDXMAPgt lt!-- The stoplist for this
file --gt ltSTOPLISTgt indexes/titlestoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtfgclt/FTAGgt lt/TAGSPEC
gt lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/C
OMPONENTDEFgt
65INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt indexes/COMPONENT_V
ITAE lt/COMPONENTNAMEgt ltCOMPONENTNORMgtNONElt/COMPONE
NTNORMgt ltCOMPSTARTTAGgt ltTAGSPECgt ltFTAGgtvtlt/FTAGgt lt
/TAGSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALNONEgt ltINDXNAMEgt
indexes/vitae_words.index lt/INDXNAMEgt ltINDXTAGgt
vt_vitae lt/INDXTAGgt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
38 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/titlestoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgtvtlt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/CO
MPONENTDEFgt lt/COMPONENTSgt
66INEX Configuration Example
ltDISPOPTIONSgt KEEP_ENTITIES lt/DISPOPTIONSgt ltDISP
LAYgt ltDISPLAYDEF NAME"B" OID"1.2.840.10003.5.10
5" DEFAULTgt ltconvert function"MIXED"gt
ltclusmapgt ltfromgt
lttagspecgt ltftaggtdoilt/ftaggt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt ltftaggt28lt/ftaggt
lt/tagspecgtlt/togt ltfromgt
lttagspecgt
ltftaggtDOCIDlt/ftaggt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt ltftaggt5lt/ftaggt
lt/tagspecgtlt/togt ltfromgt
lttagspecgt ltftaggtDBNAMElt/ftag
gt lt/tagspecgtlt/fromgt
67INEX Configuration Example
ltDISPLAYDEF name"XML_ELEMENT_"
OID"1.2.840.10003.5.109.10"gt ltconvert
function"XML_ELEMENT"gt ltclusmapgt ltfromgt
lttagspecgt ltftaggtFILENAMElt/ftaggt
lt/tagspecgtlt/fromgt lttogt
lttagspecgt ltftaggtFILENAMElt/ftaggt
lt/tagspecgtlt/togt ltfromgt lttagspecgt
ltftaggtRANKlt/ftaggt lt/tagspecgtlt/fromgt
lttogt lttagspecgt ltftaggtRANK
lt/ftaggt lt/tagspecgtlt/togt
ltfromgt lttagspecgt
ltftaggtRAWSCORElt/ftaggt
lt/tagspecgtlt/fromgt lttogt lttagspecgt
ltftaggtRAWSCORE lt/ftaggt lt/tagspecgtlt/togt
ltfromgt lttagspecgt ltftaggt
SUBST_ELEMENT lt/ftaggt lt/tagspecgtlt/fromgt
lttogt lttagspecgt ltftaggt
SUBST_ELEMENT lt/ftaggt lt/tagspecgt lt/togt
lt/clusmapgt lt/convertgt lt/DISPLAYDEFgt lt/DISPLAYgt lt/
FILEDEFgt lt/DBCONFIGgt
68INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
indexes/COMPONENT_PARAS lt/COMPONENTNAMEgt ltCOMPONEN
TNORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt ltTAGSPEC
gt ltFTAGgtilrjip1ip2ip3ip4ip5item
-nonepp1p2p3tmathtflt/FTAGgt lt/TA
GSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/para_words.index lt/INDXNAMEgt ltINDXTAGgt
para_words lt/INDXTAGgt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
39 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/topicstoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgt.lt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/C
OMPONENTDEFgt
69INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
indexes/COMPONENT_PARAS lt/COMPONENTNAMEgt ltCOMPONEN
TNORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt ltTAGSPEC
gt ltFTAGgtilrjip1ip2ip3ip4ip5item
-nonepp1p2p3tmathtflt/FTAGgt lt/TA
GSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/para_words.index lt/INDXNAMEgt ltINDXTAGgt
para_words lt/INDXTAGgt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
39 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/topicstoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgt.lt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/C
OMPONENTDEFgt
70INEX Configuration Example
ltCOMPONENTDEFgt ltCOMPONENTNAMEgt
indexes/COMPONENT_PARAS lt/COMPONENTNAMEgt ltCOMPONEN
TNORMgtNONElt/COMPONENTNORMgt ltCOMPSTARTTAGgt ltTAGSPEC
gt ltFTAGgtilrjip1ip2ip3ip4ip5item
-nonepp1p2p3tmathtflt/FTAGgt lt/TA
GSPECgt lt/COMPSTARTTAGgt ltCOMPONENTINDEXESgt lt!--
First index def --gt ltINDEXDEF ACCESSBTREE
EXTRACTKEYWORD_PROXIMITY NORMALSTEMgt ltINDXNAMEgt
indexes/para_words.index lt/INDXNAMEgt ltINDXTAGgt
para_words lt/INDXTAGgt lt!-- the appropriate Z39.50
BIB1 attribute numbers --gt ltINDXMAPgt ltUSEgt
39 lt/USEgtltPOSITgt 3 lt/positgt ltstructgt 6 lt/structgt
lt/INDXMAPgt lt!-- The stoplist for this file
--gt ltSTOPLISTgt indexes/topicstoplist
lt/STOPLISTgt lt!-- The INDXKEY area contains the
specifications of tags in the doc --gt lt!-- that
are to be extracted and indexed for this index
--gt ltINDXKEYgt ltTAGSPECgt ltFTAGgt.lt/FTAGgt lt/TAGSPECgt
lt/INDXKEYgt lt/INDEXDEFgt lt/COMPONENTINDEXESgt lt/C
OMPONENTDEFgt
71XML Schemas and Element Retrieval
72XML Schema Support
- XML Schemas or DTDs can be used to define the
data contents - Tested with a wide variety of schemas including
METS (with various supporting schemas)
73XML Element Extraction
- A new search ElementSetName is XML_ELEMENT_
- Any Xpath, element name, or regular expression
can be included following the final underscore
when submitting a present request (Note only a
subset of full Xpath is available) - The matching elements are extracted from the
records matching the search and delivered in a
simple format..
74XML Extraction
zselect sherlock 372 Connection with SHERLOCK
(sherlock.berkeley.edu) database 'bibfile' at
port 2100 is open as connection 372 zfind
topic mathematics OK S