Title: Search Engine Technology for Digital Libraries
1- Search Engine Technology for Digital Libraries
- State of the Art and Future
- 7th International Bielefeld Conference
- Jürgen Oesterle
- Juergen.oesterle_at_fastsearch.com
- Fast Search Transfer Deutschland GmbH
2Most prominent problems with digital libraries
Multiple content sources problem
- In a typical digital library, you have to
provide a combined search on many different
collections at a time - The format of the content varies between these
collections - The availability of structure varies between
these collections - The availability of external reference data
varies between these collections - The availability of meta data varies between
these collections - The kind of content might vary between these
collections
On these grounds, its extremely difficult to
provide equal ranking among the documents in a
results set, coming from different content
sources.
3Most prominent problems with digital libraries
Meta data problem
- You cant tell how the meta data was generated
- (Author? Editor? Automatically assigned?)
- You cant tell in advance what meta data is
available - (Title, author, keywords, date, publisher,
place, etc.) - You dont know the original purpose of the meta
data - (Quick summary for reader? Condensed
description? Normalization of content for
search?) - You cant assume uniform availability and
quality of meta data even on one collection
4Most prominent problems with digital libraries
Distributed documents problem
- Documents are often really hypertext, i.e. their
parts are distributed over a site, with links
between them
Multiple languages problem
- Documents are often in many different languages
Availability of classification schemas problem
- If classification is of interest (and of help
while searching), the underlying classification
taxonomies are not standardized across collections
5Most prominent problems with digital libraries
Inaccurate queries problem
- Users typically lack domain specific knowledge
- Users dont have proper terminology to hand
- Users dont include all potential synonyms and
variations in the query - Users have a problem but arent sure how to
phrase it (i.e. how the same problem is phrased
in the documents)
On these grounds, its extremely difficult to
provide a perfectly relevant result set as first
response. Intelligent suggestions for refinement
or expansion are needed.
6Technologies are underway to solve the problems..
Meta data extraction
- Automatic extraction of keywords
- Structural analysis
- Normalization of existing meta data
- Use external reference data citation analysis
Suffering from chronical rhinitis, the patient
was treated......
Part of speech tagging normalization
Vpart Prep Adj N Det N
Vcop Vpart
Vpart Prep Adj N Det N
Vcop Vpart
Extraction of specific syntactic patterns
P(chronical rhinitis)
Statistic analysis of the extracted patterns
log
P(chronical ) P(rhinits)
Identification of new terminology
chronical rhinitis
7Technologies are underway to solve the problems..
Meta data extraction
- Automatic extraction of keywords
- Structural analysis
- Normalization of existing meta data
- Use external reference data citation analysis
Journal of Cancer Research
Journal of Cancer Research
1. Analyse structure
Issue 5, 2003 -12
Issue 5, 2003 -12
Investigations in E. coli
Journal title
Investigations in E. coli
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
Article title
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
2. Determine text block features
Affiliation
Abstract
Abstract
In this study we investigate
In this study we investigate
3. Classify text blocks
1. Introduction
1. Introduction
2. Materials and Methods
2. Materials and Methods
4. Apply structure grammar
8Technologies are underway to solve the problems..
Meta data extraction
- Automatic extraction of keywords
- Structural analysis
- Normalization of existing meta data
- Use external reference data citation analysis
Artikel 1
Artikel 2
Infer relative importance of Article 5
Artikel 3
Artikel 6
and
Artikel 8
Use textual context of citation to obtain good
descriptors of it
Artikel 7
Artikel 5
Artikel 4
Citation graph
9Technologies are underway to solve the problems..
Meta data extraction
- Automatic extraction of keywords
- Structural analysis
- Normalization of existing meta data
- Use external reference data citation analysis
Infer relatedness of Article 8 and Article 7
because they are cited by the same articles
Artikel 1
Artikel 2
Artikel 3
Artikel 6
Artikel 8
Artikel 7
Artikel 5
Artikel 4
Citation graph
10Technologies are underway to solve the problems..
Equal ranking
- Test runs with representative queries
- Check typical ranking position per content
source - Assign static rank boosts per content source,
based on results
Content source B
- full text documents
- indexed in citation index
- rich meta data
Content source C
Low boost
- full text documents
- PDF, DOC ? conversion problems
Content source A
- only abstracts
- rich meta data
- no external references
High boost
Medium boost
Retrieval Engine
High boost
Content source D
Medium boost
- web data
- hard to crawl, distributed documents
- unreliable meta data
- web anchor text as external reference
Content source E
- few meta data
- indexed in citation index
- full text articles
11Technologies are underway to solve the problems..
Proper treatment of queries
- Deal with orthographic variation
- Deal with morphological variation
- Deal with vocabulary variation
- Deal with special-interest queries (e.g.
restrict on user homepages, find definitions,
narrow down on articles)
Cerebral infarct / conferences
Cerebral infarkt Serebral infarct Cetebral ingarct
Cerebral infarct
Phrasing
Doc type classification
Spellchecking
Topic classification
Cerebral infarct / medicine Cerebral infarct /
biology
Cerebral infarct
Cerebral disease Infarction
Lemmatization
Synonymy
Thesaurus support
Apoplexy Apoplectic insult Stroke
Character normalization
Cerebral infarcts
Refinement
Infarctus cérébral
Ambigue queries
12Technologies are underway to solve the problems..
Smart data aggregation
Abstract
Introduction
Journal of Cancer Research
Issue 5, 2003 -12
Chapter 1
Investigations in E. coli
B. C. Abracadabra author info S.
Miheev author info
Chapter 2
Chapter 3
While crawling for documents
Chapter 4
Abstract
Introduction
Journal contents
Chapter 1
Current Issue
Chapter 2
This issue
follow these links and put together a complete
document.
Personal Profile
Chapter 3
Chapter 4
recognize links that point to other parts of the
document
13Technologies are underway to solve the problems..
Proper treatment of queries (e.g. covering
morphol.semant. variation)
Citation index
Query processing
Query
Document processing
Crawling
INDEX
Doc
Results
Result processing
Query refinement suggestions (e.g. covering
morphol.semant. variation)
Smart data aggregation (e.g. restoring
distributed documents)
Advanced linguistic processing (e.g. terminology
extraction, classification, structural analysis)
14Scirus
15Scirus
16Scirus
17Evolution of Digital Libraries
- data base
- pure predefined meta data
- exact match
- data is
- heterogenuous
- not normalized
- incomplete
- unreliable
Traditional DL
- inverted index
- full text
- exact match
- data is
- heterogenuous
- not normalized
- redundant
- unreliable
Full text search engine
- inverted index linguistics
- smart data aggregation
- extracted information
- fuzzy search
- data is
- homogenuous
- auto-normalized
- auto-completed
- reliable
Next generation DL