Title: Lightweight Federation of Interoperable Digital Libraries
1Lightweight Federation of Interoperable Digital
Libraries
- Ph.D. Dissertation Defense
- Rong Shi
- Department of Computer Science
- Old Dominion University
- November, 2004
2Overview
- Introduction problem statement/motivation
- Background and related work
- Lightweight Federated Digital Library (LFDL)
system architecture - Design and implementation
- Conclusion and future work
- Demo
3Why Digital Libraries
- Proliferations of digital information
- Digital Library digitized, organized, universal
accessible collection of information - DL vs. WWW/Web search engine
- Search spectrum
- Contents and management
- User interface and service
4Motivation - why DL interoperability
- Digital libraries are important tools used by a
large number of people everyday - Search interfaces of digital libraries vary
greatly and can lead to confusion - Redundant work needs to be done when searching
multiple digital libraries - Most digital libraries are unable to interact
with each other - Many digital libraries have proprietary
architecture and wont change it to
interoperate with other libraries
5Objectives
- Goal of interoperability
- Interoperability among non-cooperating DLs
- Federated search service
- Existing DL systems intact
- User transparency when add, delete, modify
participants - Lightweight, dynamic, flexible
- Service quality, usability, performance, and
scalability - Solution
- LFDL lightweight Federated Digital Library, or
InterOp
6Framework
7Addressed Issues
- The feasibility of interoperability of
non-cooperating DLs - The architecture of building a federation service
- The quality and usability of the federated
service - System performance and robustness
8Background
- General approaches
- Distributed search
- accurate fresher results, but
- joint protocol required, performance, reliability
and scalability issues - Models and sample solutions
- Fully cooperative federation NCSTRL/NCSTRL,
- Protocol exchange Z3950, STARTS, SDLIP, SDARTS,
- Results Gathering meta web search engines,
SearchLight, - Metadata harvesting
- Better scalability and enhanced value-added
services, but - Update overhead and synchronization for data
freshness, harvesting protocol required
9Distributed Search Fully Cooperative
Federation NCSTRL
- Dienst, Distributed search, UI,
- High cost
- Install and run same software
- Inflexible
- Software update?
- Internal structure change?
- Most suitable for each participant?
- Performance, reliability, scalability
- Only returns after all sites have responded, or
timeout occurs - Response time worst case
- Only as good as its weakest link, not scalable
10Distributed Search Protocol Exchange SDLIP
- By Andreas Paepcke, etc. from Stanford
- Protocol between client and LSP
- LSP Library Service Proxy - Wrapper for
information sources
11SDLIP Review
- API or middleware toolkit, not for end user and
DL - Need to code both client side and DL LSP
- Non-flexible Proxy
- For each DL need to code LSP
- Hard coded access rules, results parsing rules
- Add DL or change DL? Change code, recompile,
restart - Not so lightweight
- Too comprehensive and complicated, detailed to
low level - need to install software or follow API to fully
compliant with SDLIP - Non-efficient interface
- Not really unified UI, non-simultaneous search
12Distributed Search Results Gathering
SearchLight
- California Digital Library initiatives
- Similar to meta web search engine
- Simple search interface
- Search multiple DLs at the same time
- Problem
- Same as meta web search engine
- Simple UI only keyword search, inaccurate query
mapping, irrelevant results - Non-uniformly results displaying no results
parsing, processing, or merging, only show
returned hits, search DL again for a result doc - Inflexible resource changes?
- Inefficient performance suffering, wait for all
results back
13Metadata Harvesting OAI
- Meta-data harvesting protocol
- Data Providers documents and archives manager
and maintainer - Service Providers end-user service to access
archives - OAI protocol OAI-PMH (Protocol for Metadata
harvesting) - Defines metadata harvesting framework, de-facto
standard - Syntax
- HTTP URL request and XML response
- Validated using XML Schema
- Metadata format Dublin Core
- OAI review
- For data providers
- Agreement on how to publicize metadata
- Easier to participate, but
- Still need to stick to some convention like in
federation - For service providers
- Agreement on how to utilize metadata
- Can provide scalable, efficient service but
- Need to follow convention like in federation,
data synchronization, server availability
14Summary of Current Approaches
15LFDL Introduction
- General principal
- Distributed search results gathering
- Lightweight no work for data providers to join
- Basic solution
- Dynamic DL specification registration
- Dynamic universal interface
- Dynamic Query mapping
- Efficient system management
- Local repository
16LFDL Basic Approach
17LFDL Architecture
18Data-centered specification- based LFDL Federation
19LFDL Design DL specification
- DLDL in XML
- Structure
- General info on a digital library
- Search URL
- Search method
- Query Mapping rules
- Access methods of the digital library
- Search interface definition
- Mapped to LFDL universal interface
- Results retrieval and parsing rules
- Information to be retrieved from the digital
library - Specification DTD
20DL Specification - sample
- Specification for NEEDS
- Access information
- ltSEARCHDATA Title"Search Info"gt
- ltSEARCH-METHOD Title"Search
Method"gtPOSTlt/SEARCH-METHODgt - ltSEARCH-URL Title"Search
URL"gthttp//www.needs.org/needs/public/search/sea
rch_results/index.jhtml?_DARGS/needs/public/searc
h/index_body.jhtmllt/SEARCH-URLgt - Search interface
- ltFORMFIELDgt
- ltINPUTNAMEgt
- ltINPUTNAME_VALUE Title"Internal Form
Name"gt/smete/forms/FindLearningObjects.keywordlt/I
NPUTNAME_VALUEgt - ltINPUTNAME_MAPPING Title"Mapped UI
Field Name"gtUI_keywordlt/INPUTNAME_MAPPINGgt - lt/INPUTNAMEgt
- ltINPUTTYPE Title"Form Type"gttext
inputlt/INPUTTYPEgt - ltINPUTVALUE/gt
- lt/FORMFIELDgt
- Simulated search interface from specification
21DL Specification sample cont.
- Results matching
- ltOVAR-TAG Title"Output Tag"gtAlt/OVAR-TAGgt
- ltOVAR-MATCH Title"Output Match"gtneeds/public/sea
rch/search_results/learning_resource/summarylt/OVAR
-MATCHgt - ltCOMMENT-MATCH-START Title"Comment match
start"gt, lt/COMMENT-MATCH-STARTgt - ltCOMMENT-MATCH-END Title"Comment match
end"gt/pgtlt/COMMENT-MATCH-ENDgt - Multiple results page
- ltMULTIPAGE Title"Multi Page Information"gt
- ltMULTI-PAGE Title"MultiPage"gtyeslt/MULTI-PAGEgt
- ltHAS-NEXT Title"Contains Next
Link"gtnolt/HAS-NEXTgt - ltNEXT-URL Title"Matching String"gtnulllt/NEXT
-URLgt - ltLINK-URL Title"Matching String"gt/needs/pub
lic/search/search_results/index.jhtml?queryIdlt/LI
NK-URLgt - ltURL-ADDITIONAL-MATCHgtpagelt/URL-ADDITIONAL-M
ATCHgt - ltPAGE-HIT Title"No. of hits per
page"gt10lt/PAGE-HITgt - lt/MULTIPAGEgt
22Query Mapping Samples
23Specification Issues
- Interface capture and query mapping
- Semantics mapping
- Non-web form based proprietary interface
- Java applet
- Multimedia
- Search behavior simulation and specification
- Access control
- Multi-steps search
24LFDL Prototype Universal Search Interface
25LFDL Search Service
26LFDL Search Service User-centered Dynamic Search
- Keyword-driven dynamic interface
- Base on keyword-hit set of each DL
- Generating keyword-hit set for each DL
- Source of base keyword set from OAI test-bed
- Data from Arc metadata database
- Data from user search logs
- Generate keyword-hit set
- Calculate hits from Arc metadata database
- Query remote DL based on base keywords, parse
results - DL specification add parsing rules
- ltDOCHIT Title"Doc hits match string"gt
- ltMATCHSTRING Title"Output Match"gttotal
resultslt/MATCHSTRINGgt - ltBEFORESTRING Title"before
string"gtoflt/BEFORESTRINGgt - ltAFTERSTRING Title"after string"gttotal
resultslt/AFTERSTRINGgt - lt/DOCHITgt
27Keyword-hit - from Arc DB and remote DL
28Populating Keyword-hit for a DL
29Dynamic Interface- Design
- Interface generation
- Generic universal interface
- Based on Dublin Core
- Complete DL specification in DLDL
- Filter field type, name, values
- Mapping with UI field
- Allow DL unique features (no mapping)
30Interface Generation Algorithm
- Factors
- Input keyword
- Generic base UI
- DL keyword-hit
- DL specification
- Algorithm
- Weight based DL features selection
- DL weight determined by keyword-hit
- Absolute feature weight within UI
- Relative feature weight within a DL
- User behavior from user features selection log
- Algorithm
- balance all weights
- select features with highest weight
31Universal Search Interface Based on DC Element
32Enhanced DLDL and Specification
- ltFORMFIELDgt
- Â ltREQUIRED Title"Required Field or
not"gtYlt/REQUIREDgt - Â ltWEIGHT Title"Weight of Field"gt1lt/WEIGHTgt
- Â ltTYPE Title"Search Criteria or Display
Option"gtSearch Criterialt/TYPEgt - Â ltLABEL Title"Displayed Field
Name"gtKeywordslt/LABELgt - Â ltLENGTH Title"Field Length"gt35lt/LENGTHgt
- - ltINPUTNAMEgt
- Â ltINPUTNAME_VALUE Title"Internal Form
Name"gtkeywordslt/INPUTNAME_VALUEgt - Â ltINPUTNAME_MAPPING Title"Mapped UI Field
Name"gtUI_keywordlt/INPUTNAME_MAPPINGgt - Â lt/INPUTNAMEgt
- Â ltINPUTTYPE Title"Form Type"gttext
inputlt/INPUTTYPEgt - Â ltINPUTVALUE /gt
- Â lt/FORMFIELDgt
33Experimentation and Implementation Interface
for keyword html
34Experimentation and Implementation Interface
for keyword university
35Experimentation and Implementation Interface
customization
36Results Presentation Service Automatic Metadata
Extraction
- Metadata is key
- Service usability present rich , interactive,
and dynamic search results consistently - Performance and robustness local repository and
intelligent cash - Available metadata sources
- List page of search results
- Detail page of a selected document/record
- Metadata retrieval approach
- Define specification on how metadata are
presented in those pages - Use Dublin Core as common metadata mapping set
- Develop metadata parser to extract metadata
- Store parsed metadata in local repository
- Build up metadata repository
- Proactive
- Passive or piggyback
37Metadata Extraction Approach
38(No Transcript)
39Metadata Retrieval and Parsing Workflow
40Metadata Parsing Rules Definition
- Extended DLDL
- Two levels list page and record page
- String parsing separate raw string to segments
corresponding to metadata fields
41Part of DTD for DL parsing rules specification
- lt!ELEMENT RESULT-METADATA (MATCH-START,MATCH-END,E
XCLUDE,REPLACE,DELIMITER,METADATA-FIELD)gt - lt!ELEMENT RECORD-METADATA (MATCH-START?,MATCH-END?
,EXCLUDE,REPLACE,DELIMITER,METADATA-FIELD)gt - lt!ELEMENT METADATA-FIELD (PCDATA)gt
- lt!ATTLIST METADATA-FIELD Title CDATA "information
about a particular metadata field"gt - lt!ATTLIST METADATA-FIELD order CDATA IMPLIEDgt
- lt!ATTLIST METADATA-FIELD multiple (true false)
IMPLIEDgt - lt!ATTLIST METADATA-FIELD delimeter CDATA
IMPLIEDgt - lt!ATTLIST METADATA-FIELD format CDATA IMPLIEDgt
- lt!ATTLIST METADATA-FIELD null_value_string CDATA
IMPLIEDgt
42Sample Specification for CogPrints
- ltRESULT-METADATA hasRecordLevel"true"gt
- ltMATCH-STARTgtnulllt/MATCH-STARTgt
- ltMATCH-ENDgtnulllt/MATCH-ENDgt
- lt/RESULT-METADATAgt
- ltRECORD-METADATAgt
- ltMATCH-STARTgtname"DC.title"lt/MATCH-STARTgt
- ltMATCH-END isLastIndex"true"gt"
name"DC.creator"lt/MATCH-ENDgt - ltEXCLUDEgt/gtltmeta content"lt/EXCLUDEgt
- ltREPLACEgt
- ltOLD-STRINGgt" name"DC.creator"lt/OLD-STRINGgt
- ltNEW-STRINGgt lt/NEW-STRINGgt
- lt/REPLACEgt
- ltMETADATA-FIELD order"1" multiple"true"
delimeter" "gtCREATORlt/METADATA-FIELDgt - lt/RECORD-METADATAgt
43Results
44Results Merging and Presentation
- Group results based on metadata field
- Can further tailor interface and format results
using XSLT
45Performance Improvement Intelligent Cache
- Search scenario
- Case 1 a query for keywordcomputer
- Case 2 a query for keywordcomputer AND
date2002 - Results LFDL prototype caching
- Cache grouped by query string, so
- Case 1 no cache hits, distributed search request
sent to DLs - Case 2 no cache hits, distributed search request
sent to DLs - Intelligent Cache Enhanced LFDL caching
- Cache grouped by metadata, so
- Case 1 no cache hits, distributed search request
sent to DLs - Case 2 cache hits, search served locally
46Local Metadata Repository
- All searches are served locally first
- A secondary in memory metadata cache for better
performance and system reliability - Cache grouped by metadata instead of query string
- Cache-based distributed search
- Display results from cache, at the same time
- Still send out query to DLs to update cache
- Transparent to end users
47Metadata Cache and Repository
48Cache Replacement Algorithm
- Replacement algorithm least used plus least
recent used metadata - Initial system-wide parameters cache size, cache
keep safe size - Runtime parameters per metadata record
date_last_used, total_usage - Algorithm implementation
- when first start load from db order by
date_last_used, total_usage and pick based on
cache size - String orderBy " ORDER BY total_usage desc,
date_last_used desc" - String selectMetadata "SELECT internalID,
identifier, archive, datestamp, title, creator,
subject, description, publisher, publication,
keyword, category contributor, type, format,
source, language, status, date_last_used,
total_usage FROM dc orderBy - each time when user view a metadata, update
date_last_used and total_usage - if cache full, remove least used from cache and
save to db(first sort by date_last_used, keep
safe, then sort by total_usage) - cache size and keep safe size can changed at
runtime
49Registration and Management service
50Registration and Management
- Registration service
- Validate, add, update and remove a DLs
specification - Implementation
- LDAP-based
- Tightly-integrated
- Management service
- Real-time system monitoring
- Registered DL
- Average system response time
- Resource usage
- Search activities
-
- Real-time system reconfiguration and maintainence
- Turn on/off debug mode
- Reallocate system resources
- Update keyword-hits database
-
51Major Contributions
- Scope
- provide service for non-cooperating DLs
- DLs like IEEE and ACM may continue to work
independently without participating any
interoperation - Automatic metadata extraction from
non-cooperating DLs - Architecture
- lightweight, dynamic, data-centered, rule-driven
architecture - DL specification defines interoperability
processing rules using DLDL, a human-readable and
highly maintainable xml-based language - Dynamic DL registration, removal, or modification
- Powerful federation engine enforces rules defined
in specification and enable DLs to join
federation in real-time no code change or
restart system - quickly forming a gathering service for a special
community, just add specification of DL and that
DL will be incorporated into service on the fly
lots of work behind the scene come up a new
language, develop a processing engine to process
spec written in that language - Can be used in other domain, like web robot,
shopping agent, price comparison agent
52Major Contributions
- Approach
- lies in between of distributed search with no
caching (Dienst) and harvesting (Arc) - so can achieve both advantages - data freshness
from distributed search and rich service,
reliability, performance from harvesting - Service design
- Service quality and usability
- dynamic user-centered, keyword driven search
interface - can be applied to other DL applications, like
archon, to design flexible interface based on
archive and metadata - Results processing and presentation for rich,
user friendly service - parse results, so that can display results
uniformly without showing each DL native results
page - Automatic metadata parsing and retrieval can be
used by other domains and applications such as
metadata extraction from PDF files - System performance, efficiency and robustness
- Local metadata repository and intelligent cache
53Publications
- R. Shi, K. Maly and M. Zubair. Interoperable
Federated Digital Library using XML and LDAP.
Global Digital Library Development in the New
Millennium, pages 277-286, May 2001. - R. Shi, K. Maly and M. Zubair. Dynamic
Interoperation of Non-cooperating Digital
Libraries. In Proceedings of International
Conference on Digital Library - IT Opportunities
and Challenges in the New Millennium, Beijing,
China, July 2002. - R. Shi, K. Maly and M. Zubair. Automatic Metadata
Discovery from Non-cooperative Digital libraries.
In Proceedings of IADIS International Conference
on e-Society 2003, Lisbon, Portugal, June 2003. - R. Shi, K. Maly and M. Zubair. Improving
Federated Service for Non-cooperating Digital
Libraries. In Proceedings of International
Conference on Digital Libraries, New Delhi,
India, February 2004. - M. Zubair, K.Maly, and R. Shi. Focus Research
Libraries in Support of Active Learning. In
Proceedings of International Conference on
Information and Communication Technologies for
Education, Vienna, December 2000.
54Major Issues
- Scalability
- not easy to incorporate a large number of DL at
one time - cache size limit caching
- Resource intensive when serve query with large
amount of results to process - however, it is still useful for building service
for special communities with limited number of
DLs - Incorporate DLs with complex or non web-based
search interface - DL specification generation not automatic
- DL behavior discovery still need to human
intervene to find out if a DL change its
searching and presenting mechanism - Effective evaluation, measurement, usefulness
testing
55Conclusion and Future Works
- Federation service for non-cooperating DLs is
possible - Dynamic user-centered interface is practical to
improve quality of service - Locally harvested metadata improve service
usability and performance - Future works
- Complex interface mapping, access control
- Scalability, and performance
- Automatic specification generation
- DL behavior changes discovery
- Dynamic interface keyword relevance instead of
hit (only user select that DL or DL has relevant
results) - Personalized portal customized interface and
results displaying most often used search and
remember search preference caching options for
fresh data or fast results
56Demo
- Steps http//www.cs.odu.edu/shi/interop/demo/doc
/dissertation/demo/demo_steps.html - Site http//128.82.7.868088/interop/demo/index.h
tml