SDLIP STARTS = SDARTS A Protocol and Toolkit for Metasearching PowerPoint PPT Presentation

presentation player overlay
1 / 24
About This Presentation
Transcript and Presenter's Notes

Title: SDLIP STARTS = SDARTS A Protocol and Toolkit for Metasearching


1
SDLIP STARTS SDARTSA Protocol and Toolkit
for Metasearching
  • Noah Green
  • Panagiotis G. Ipeirotis
  • Luis Gravano

Computer Science Dept., Columbia University
2
Web vs. Hidden Web
  • Web
  • Link structure
  • Crawlable
  • Individual collections (or Hidden Web)
  • No link structure
  • Documents hidden behind search forms

3
Metasearching
  • Given many document sources and a query, a
    metasearcher
  • Finds the good sources for the query.
  • Evaluates the query at these sources.
  • Merges the results from these sources.

Metasearcher
Existing Web Application
Non-indexed Documents
Legacy Database / WAIS / etc.
4
Metasearching Issues
  • How to evaluate the relevance of different
    sources?
  • How to get metadata?
  • How to query different types of sources?
  • How to merge the results?

Metasearcher
http///getTitle? titlebiomedical
SELECT title FROM articles . . .
grep biomedical .txt
5
Solution A Common Protocol
6
Why SDARTS SDLIPSTARTS?
  • NOT yet another protocol
  • We combined existing efforts, keeping
    compatibility
  • SDLIP defines a common interface for interacting
    with the sources
  • STARTS defines expressive metadata that sources
    should export

7
SDARTS Outline
  • Description of SDLIP.
  • Description of STARTS.
  • Integration of SDLIP and STARTS into SDARTS.
  • Implementation and configuration of SDARTS
    wrappers.

8
SDLIP Simple Digital Library
SDLIP Interoperability Protocol
  • Developed during DLI2 project by
  • Stanford University
  • UC Berkeley
  • UC San Diego
  • UC Santa Barbara
  • San Diego Supercomputer Center
  • California Digital Library

9
SDLIP An Interoperability Protocol
Common SDLIP interface
  • Basic interfaces
  • Search
  • Metadata
  • A wrapper implements these interfaces
  • Interface parameter and return types are XML
  • Transport layer implementations (HTTP, CORBA)
  • Flexible and adaptable
  • Optimized for clients that know the source to
    query
  • (i.e., simple requirements for metadata)

10
STARTS Informal Standard for Search Engine
Interoperability
  • Coordinated by Stanford in 1996
  • Both search engine vendors and "users
    participated
  • Netscape
  • Microsoft Network
  • GILS
  • Infoseek
  • Harvest
  • Hewlett-Packard
  • Fulcrum
  • Verity
  • Wais
  • PLS
  • Excite

11
STARTS A Metasearching Protocol
  • Defines
  • Query language
  • Results format
  • Metadata for the collection
  • No specified transport layer or implementation
  • Naturally complements SDLIP for metasearching
    purposes

Example of metadata Stemming no of docs
20,000 Diabetes ? TF12, DF 4 XML ? TF1200,
DF750
12
SDARTS SDLIP SDARTS
  • Extends SDLIP with a richer metadata
  • interface from STARTS
  • Keeps compatibility with SDLIP (same DTDs)
  • Can support easily similar protocols
  • (transforming XML is easy)
  • Makes wrapping collections easy through a toolkit

13
SDARTS Implementation Details
  • Defined STARTS using XML new version named
    STARTS XML.
  • Used the getPropertyInfo() from SDLIP to extend
    SDLIP with STARTS metadata.
  • Term frequency information is available through a
    different URL (faster download for metasearchers
    that do not use it).

14
Example of STARTS Metadata Content Summary
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lt!DOCTYPE startsscontent-summary SYSTEM
    "http//www.cs.columbia.edu/dli2test/dtd/starts.d
    td"gt
  • ltstartsscontent-summary xmlnsstarts"http//www.
    cs.columbia.edu/dli2test/STARTS/"
  • version"Starts 1.0"
  • stemming"false"
  • stopwords"false"
  • case-sensitive"true"
  • fields"false"
  • numdocs"19997" gt
  • ltstartsfield-freq-infogt
  • ltstartsfield type-set"basic1"
    name"body-of-text"/gt
  • ltstartstermgt
  • ltstartsvaluegtalgorithmlt/startsvalue
    gt
  • lt/startstermgt
  • ltstartsterm-freqgt75lt/startsterm-f
    reqgt
  • ltstartsdoc-freqgt34lt/startsdoc-fre
    qgt

15
SDARTS Wrapper Design Rationale
  • Goal Isolate developer from parsing and
    generating STARTS XML requests and responses
  • Goal Reusability and simplicity
  • SDARTS toolkits and reference implementations
  • Wrapping local text document collections
  • Wrapping XML collections
  • Wrapping HTTP/CGI interfaces

16
SDARTS Wrapping Architecture
SDLIP LSP
Client Program
STARTS XML over HTTP/DASL
LSPObjects
BackEndLSP
SDARTS Bean
S
FrontEnd LSP
M
Existing SDLIP Client
STARTS XML
Native Protocol/ Search Engine
17
SDARTS Wrapper Implementation
  • Standardize on STARTS as the XML protocol for
    SDLIP
  • Create a standard wrapper architecture

LSPObjects
STARTS XML
BackEnd LSP
S
FrontEnd LSP
M
  • Front-End
  • Implements SDLIP interfaces
  • Communicates with client using STARTS XML nested
  • inside SDLIP method calls
  • Back-End
  • Communicates with front-end using simple
    container objects
  • Talks to underlying collection using native
    protocol

Native Protocol/ Search Engine
18
Adding a Local Text Collection
  • Write standard doc_config.xml file
  • Regular expressions to describe where to find
    fields
  • No coding or compilation needed!

doc_ config .xml
meta_ attributes .xml
content_ summary .xml
index
TextBackEndLSP
Lucene Search Engine
Non-indexed Text Documents
19
Sample doc_config.xml
ltdoc-config re-index"true"gt ltpathgt/home/dli2test
/collections/doc1/20groupslt/pathgt ltlinkage-prefix
gthttp//localhost/20groupslt/linkage-prefixgt . . .
. . . . . ltstop-wordsgtltwordgtthelt/wordgt ltwordgtalt/
wordgtlt/stop-wordsgt . . . . . . . .
ltfield-descriptor name"author"gt ltstartgtltregex
pgtFrom lt/regexpgtlt/startgt ltendgtltregexpgtlt/regex
pgtlt/endgt lt/field-descriptorgt . . . . . . . .
lt/doc-configgt
20
Adding a Local XML Collection
  • Write standard doc_config.xml file
  • Write an XSL stylesheet to find fields in
    documents
  • No coding or compilation needed!

doc_style.xsl
meta_ attributes .xml
content_ summary .xml
index
doc_config.xml
Apache Xalan XSL Processor
Lucene Search Engine
XMLBackEndLSP
Non-indexed XML Documents
21
Adding an External Web Collection
  • Must code a custom wrapper to send correct CGI
    parameters and parse returning HTML
  • No new code needed uses XSLT for parsing the
    results
  • Usually no metadata or content summary available
  • Possible to automate metadata extraction
  • Callan et al., SIGMOD99 Automatic extraction
    of vocabulary statistics
  • Ipeirotis et al., SIGMOD01 Automatic
    categorization of databases
  • Raghavan and Garcia-Molina, VLDB 2001
    Automatic interaction with forms

meta_attributes.xml
Web BackEnd LSP
HTTP/CGI Collection
22
Conclusions
  • SDARTS uses SDLIP interfaces and code (compatible
    with it).
  • SDARTS enhances SDLIP and STARTS.
  • Reference wrappers available for common
    collection types.
  • Any text or XML document collection can be easily
    wrapped without new compiled code.
  • Automatic metadata extraction for local
    collections
  • Using XSLT for web wrappers
  • Possible to automate the extraction of rich
    metadata for web-accessible collections
  • New wrappers can be written without having to
    parse or generate STARTS XML.
  • SDARTS is in Java and can run on multiple
    platforms.

23
We are on the Web )
  • Available for downloading
  • SDARTS DTDs and documentation
  • Java code and search engine (Lucene) included
  • Source code documentation
  • Web client source code
  • Reference wrappers (text, XML, web)
  • Wrapped collections
  • The web client is web-accessible for the public
    to test and query our SDARTS server
  • http//sdarts.cs.columbia.edu/

24
Related Work
  • Metadata
  • Open Archives
  • Dublin Core
  • MARC
  • Interoperability Protocols
  • Z39.50
  • GILS
Write a Comment
User Comments (0)
About PowerShow.com