Title: SDLIP STARTS = SDARTS A Protocol and Toolkit for Metasearching
1SDLIP STARTS SDARTSA Protocol and Toolkit
for Metasearching
- Noah Green
- Panagiotis G. Ipeirotis
- Luis Gravano
Computer Science Dept., Columbia University
2Web vs. Hidden Web
- Web
- Link structure
- Crawlable
- Individual collections (or Hidden Web)
- No link structure
- Documents hidden behind search forms
3Metasearching
- Given many document sources and a query, a
metasearcher - Finds the good sources for the query.
- Evaluates the query at these sources.
- Merges the results from these sources.
Metasearcher
Existing Web Application
Non-indexed Documents
Legacy Database / WAIS / etc.
4Metasearching Issues
- How to evaluate the relevance of different
sources? - How to get metadata?
- How to query different types of sources?
- How to merge the results?
Metasearcher
http///getTitle? titlebiomedical
SELECT title FROM articles . . .
grep biomedical .txt
5Solution A Common Protocol
6Why SDARTS SDLIPSTARTS?
- NOT yet another protocol
- We combined existing efforts, keeping
compatibility - SDLIP defines a common interface for interacting
with the sources - STARTS defines expressive metadata that sources
should export
7SDARTS Outline
- Description of SDLIP.
- Description of STARTS.
- Integration of SDLIP and STARTS into SDARTS.
- Implementation and configuration of SDARTS
wrappers.
8SDLIP Simple Digital Library
SDLIP Interoperability Protocol
- Developed during DLI2 project by
- Stanford University
- UC Berkeley
- UC San Diego
- UC Santa Barbara
- San Diego Supercomputer Center
- California Digital Library
9SDLIP An Interoperability Protocol
Common SDLIP interface
- Basic interfaces
- Search
- Metadata
- A wrapper implements these interfaces
- Interface parameter and return types are XML
- Transport layer implementations (HTTP, CORBA)
- Flexible and adaptable
- Optimized for clients that know the source to
query - (i.e., simple requirements for metadata)
10STARTS Informal Standard for Search Engine
Interoperability
- Coordinated by Stanford in 1996
- Both search engine vendors and "users
participated - Netscape
- Microsoft Network
- GILS
- Infoseek
- Harvest
- Hewlett-Packard
- Fulcrum
- Verity
- Wais
- PLS
- Excite
11STARTS A Metasearching Protocol
- Defines
- Query language
- Results format
- Metadata for the collection
- No specified transport layer or implementation
- Naturally complements SDLIP for metasearching
purposes
Example of metadata Stemming no of docs
20,000 Diabetes ? TF12, DF 4 XML ? TF1200,
DF750
12SDARTS SDLIP SDARTS
- Extends SDLIP with a richer metadata
- interface from STARTS
- Keeps compatibility with SDLIP (same DTDs)
- Can support easily similar protocols
- (transforming XML is easy)
- Makes wrapping collections easy through a toolkit
13SDARTS Implementation Details
- Defined STARTS using XML new version named
STARTS XML. - Used the getPropertyInfo() from SDLIP to extend
SDLIP with STARTS metadata. - Term frequency information is available through a
different URL (faster download for metasearchers
that do not use it).
14Example of STARTS Metadata Content Summary
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt!DOCTYPE startsscontent-summary SYSTEM
"http//www.cs.columbia.edu/dli2test/dtd/starts.d
td"gt - ltstartsscontent-summary xmlnsstarts"http//www.
cs.columbia.edu/dli2test/STARTS/" - version"Starts 1.0"
- stemming"false"
- stopwords"false"
- case-sensitive"true"
- fields"false"
- numdocs"19997" gt
- ltstartsfield-freq-infogt
- ltstartsfield type-set"basic1"
name"body-of-text"/gt - ltstartstermgt
- ltstartsvaluegtalgorithmlt/startsvalue
gt - lt/startstermgt
- ltstartsterm-freqgt75lt/startsterm-f
reqgt - ltstartsdoc-freqgt34lt/startsdoc-fre
qgt
15SDARTS Wrapper Design Rationale
- Goal Isolate developer from parsing and
generating STARTS XML requests and responses - Goal Reusability and simplicity
- SDARTS toolkits and reference implementations
- Wrapping local text document collections
- Wrapping XML collections
- Wrapping HTTP/CGI interfaces
16SDARTS Wrapping Architecture
SDLIP LSP
Client Program
STARTS XML over HTTP/DASL
LSPObjects
BackEndLSP
SDARTS Bean
S
FrontEnd LSP
M
Existing SDLIP Client
STARTS XML
Native Protocol/ Search Engine
17SDARTS Wrapper Implementation
- Standardize on STARTS as the XML protocol for
SDLIP - Create a standard wrapper architecture
LSPObjects
STARTS XML
BackEnd LSP
S
FrontEnd LSP
M
- Front-End
- Implements SDLIP interfaces
- Communicates with client using STARTS XML nested
- inside SDLIP method calls
- Back-End
- Communicates with front-end using simple
container objects - Talks to underlying collection using native
protocol
Native Protocol/ Search Engine
18Adding a Local Text Collection
- Write standard doc_config.xml file
- Regular expressions to describe where to find
fields - No coding or compilation needed!
doc_ config .xml
meta_ attributes .xml
content_ summary .xml
index
TextBackEndLSP
Lucene Search Engine
Non-indexed Text Documents
19Sample doc_config.xml
ltdoc-config re-index"true"gt ltpathgt/home/dli2test
/collections/doc1/20groupslt/pathgt ltlinkage-prefix
gthttp//localhost/20groupslt/linkage-prefixgt . . .
. . . . . ltstop-wordsgtltwordgtthelt/wordgt ltwordgtalt/
wordgtlt/stop-wordsgt . . . . . . . .
ltfield-descriptor name"author"gt ltstartgtltregex
pgtFrom lt/regexpgtlt/startgt ltendgtltregexpgtlt/regex
pgtlt/endgt lt/field-descriptorgt . . . . . . . .
lt/doc-configgt
20Adding a Local XML Collection
- Write standard doc_config.xml file
- Write an XSL stylesheet to find fields in
documents - No coding or compilation needed!
doc_style.xsl
meta_ attributes .xml
content_ summary .xml
index
doc_config.xml
Apache Xalan XSL Processor
Lucene Search Engine
XMLBackEndLSP
Non-indexed XML Documents
21Adding an External Web Collection
- Must code a custom wrapper to send correct CGI
parameters and parse returning HTML - No new code needed uses XSLT for parsing the
results - Usually no metadata or content summary available
- Possible to automate metadata extraction
- Callan et al., SIGMOD99 Automatic extraction
of vocabulary statistics - Ipeirotis et al., SIGMOD01 Automatic
categorization of databases - Raghavan and Garcia-Molina, VLDB 2001
Automatic interaction with forms
meta_attributes.xml
Web BackEnd LSP
HTTP/CGI Collection
22Conclusions
- SDARTS uses SDLIP interfaces and code (compatible
with it). - SDARTS enhances SDLIP and STARTS.
- Reference wrappers available for common
collection types. - Any text or XML document collection can be easily
wrapped without new compiled code. - Automatic metadata extraction for local
collections - Using XSLT for web wrappers
- Possible to automate the extraction of rich
metadata for web-accessible collections - New wrappers can be written without having to
parse or generate STARTS XML. - SDARTS is in Java and can run on multiple
platforms.
23We are on the Web )
- Available for downloading
- SDARTS DTDs and documentation
- Java code and search engine (Lucene) included
- Source code documentation
- Web client source code
- Reference wrappers (text, XML, web)
- Wrapped collections
- The web client is web-accessible for the public
to test and query our SDARTS server - http//sdarts.cs.columbia.edu/
24Related Work
- Metadata
- Open Archives
- Dublin Core
- MARC
-
- Interoperability Protocols
- Z39.50
- GILS