Search Engine Technology for Digital Libraries - PowerPoint PPT Presentation

About This Presentation

Title:

Search Engine Technology for Digital Libraries

Description:

Search Engine Technology for Digital Libraries State of the Art and Future 7th International Bielefeld Conference J rgen Oesterle Juergen.oesterle_at_fastsearch.com – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 18

Provided by: Johann223

Category:

more less

Transcript and Presenter's Notes

Title: Search Engine Technology for Digital Libraries

1

Search Engine Technology for Digital Libraries
State of the Art and Future
7th International Bielefeld Conference
Jürgen Oesterle
Juergen.oesterle_at_fastsearch.com
Fast Search Transfer Deutschland GmbH

2
Most prominent problems with digital libraries
Multiple content sources problem

In a typical digital library, you have to
provide a combined search on many different
collections at a time
The format of the content varies between these
collections
The availability of structure varies between
these collections
The availability of external reference data
varies between these collections
The availability of meta data varies between
these collections
The kind of content might vary between these
collections

On these grounds, its extremely difficult to
provide equal ranking among the documents in a
results set, coming from different content
sources.
3
Most prominent problems with digital libraries
Meta data problem

You cant tell how the meta data was generated
(Author? Editor? Automatically assigned?)
You cant tell in advance what meta data is
available
(Title, author, keywords, date, publisher,
place, etc.)
You dont know the original purpose of the meta
data
(Quick summary for reader? Condensed
description? Normalization of content for
search?)
You cant assume uniform availability and
quality of meta data even on one collection

4
Most prominent problems with digital libraries
Distributed documents problem

Documents are often really hypertext, i.e. their
parts are distributed over a site, with links
between them

Multiple languages problem

Documents are often in many different languages

Availability of classification schemas problem

If classification is of interest (and of help
while searching), the underlying classification
taxonomies are not standardized across collections

5
Most prominent problems with digital libraries
Inaccurate queries problem

Users typically lack domain specific knowledge
Users dont have proper terminology to hand
Users dont include all potential synonyms and
variations in the query
Users have a problem but arent sure how to
phrase it (i.e. how the same problem is phrased
in the documents)

On these grounds, its extremely difficult to
provide a perfectly relevant result set as first
response. Intelligent suggestions for refinement
or expansion are needed.
6
Technologies are underway to solve the problems..
Meta data extraction

Automatic extraction of keywords
Structural analysis
Normalization of existing meta data
Use external reference data citation analysis

Suffering from chronical rhinitis, the patient
was treated......
Part of speech tagging normalization
Vpart Prep Adj N Det N
Vcop Vpart
Vpart Prep Adj N Det N
Vcop Vpart
Extraction of specific syntactic patterns
P(chronical rhinitis)
Statistic analysis of the extracted patterns
log
P(chronical ) P(rhinits)
Identification of new terminology
chronical rhinitis
7
Technologies are underway to solve the problems..
Meta data extraction

Automatic extraction of keywords
Structural analysis
Normalization of existing meta data
Use external reference data citation analysis

Journal of Cancer Research
Journal of Cancer Research
1. Analyse structure
Issue 5, 2003 -12
Issue 5, 2003 -12
Investigations in E. coli
Journal title
Investigations in E. coli
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
Article title
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
2. Determine text block features
Affiliation
Abstract
Abstract
In this study we investigate
In this study we investigate
3. Classify text blocks
1. Introduction
1. Introduction
2. Materials and Methods
2. Materials and Methods
4. Apply structure grammar
8
Technologies are underway to solve the problems..
Meta data extraction

Automatic extraction of keywords
Structural analysis
Normalization of existing meta data
Use external reference data citation analysis

Artikel 1
Artikel 2
Infer relative importance of Article 5
Artikel 3
Artikel 6
and
Artikel 8
Use textual context of citation to obtain good
descriptors of it
Artikel 7
Artikel 5
Artikel 4
Citation graph
9
Technologies are underway to solve the problems..
Meta data extraction

Automatic extraction of keywords
Structural analysis
Normalization of existing meta data
Use external reference data citation analysis

Infer relatedness of Article 8 and Article 7
because they are cited by the same articles
Artikel 1
Artikel 2
Artikel 3
Artikel 6
Artikel 8
Artikel 7
Artikel 5
Artikel 4
Citation graph
10
Technologies are underway to solve the problems..
Equal ranking

Test runs with representative queries
Check typical ranking position per content
source
Assign static rank boosts per content source,
based on results

Content source B

full text documents
indexed in citation index
rich meta data

Content source C
Low boost

full text documents
PDF, DOC ? conversion problems

Content source A

only abstracts
rich meta data
no external references

High boost
Medium boost
Retrieval Engine
High boost
Content source D
Medium boost

web data
hard to crawl, distributed documents
unreliable meta data
web anchor text as external reference

Content source E

few meta data
indexed in citation index
full text articles

11
Technologies are underway to solve the problems..
Proper treatment of queries

Deal with orthographic variation
Deal with morphological variation
Deal with vocabulary variation
Deal with special-interest queries (e.g.
restrict on user homepages, find definitions,
narrow down on articles)

Cerebral infarct / conferences
Cerebral infarkt Serebral infarct Cetebral ingarct
Cerebral infarct
Phrasing
Doc type classification
Spellchecking
Topic classification
Cerebral infarct / medicine Cerebral infarct /
biology
Cerebral infarct
Cerebral disease Infarction
Lemmatization
Synonymy
Thesaurus support
Apoplexy Apoplectic insult Stroke
Character normalization
Cerebral infarcts
Refinement
Infarctus cérébral
Ambigue queries
12
Technologies are underway to solve the problems..
Smart data aggregation
Abstract
Introduction
Journal of Cancer Research
Issue 5, 2003 -12
Chapter 1
Investigations in E. coli
B. C. Abracadabra author info S.
Miheev author info
Chapter 2
Chapter 3
While crawling for documents
Chapter 4
Abstract
Introduction
Journal contents
Chapter 1
Current Issue
Chapter 2
This issue
follow these links and put together a complete
document.
Personal Profile
Chapter 3
Chapter 4
recognize links that point to other parts of the
document
13
Technologies are underway to solve the problems..
Proper treatment of queries (e.g. covering
morphol.semant. variation)
Citation index
Query processing
Query
Document processing
Crawling
INDEX
Doc
Results
Result processing
Query refinement suggestions (e.g. covering
morphol.semant. variation)
Smart data aggregation (e.g. restoring
distributed documents)
Advanced linguistic processing (e.g. terminology
extraction, classification, structural analysis)
14
Scirus
15
Scirus
16
Scirus
17
Evolution of Digital Libraries