Search Engine Technology for Digital Libraries - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Search Engine Technology for Digital Libraries

Description:

Fast Search & Transfer Deutschland GmbH. Most prominent problems with digital libraries ' ... In a typical digital library, you have to provide a combined ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 18
Provided by: johanne53
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Technology for Digital Libraries


1
  • Search Engine Technology for Digital Libraries
  • State of the Art and Future
  • 7th International Bielefeld Conference
  • Jürgen Oesterle
  • Juergen.oesterle_at_fastsearch.com
  • Fast Search Transfer Deutschland GmbH

2
Most prominent problems with digital libraries
Multiple content sources problem
  • In a typical digital library, you have to
    provide a combined search on many different
    collections at a time
  • The format of the content varies between these
    collections
  • The availability of structure varies between
    these collections
  • The availability of external reference data
    varies between these collections
  • The availability of meta data varies between
    these collections
  • The kind of content might vary between these
    collections

On these grounds, its extremely difficult to
provide equal ranking among the documents in a
results set, coming from different content
sources.
3
Most prominent problems with digital libraries
Meta data problem
  • You cant tell how the meta data was generated
  • (Author? Editor? Automatically assigned?)
  • You cant tell in advance what meta data is
    available
  • (Title, author, keywords, date, publisher,
    place, etc.)
  • You dont know the original purpose of the meta
    data
  • (Quick summary for reader? Condensed
    description? Normalization of content for
    search?)
  • You cant assume uniform availability and
    quality of meta data even on one collection

4
Most prominent problems with digital libraries
Distributed documents problem
  • Documents are often really hypertext, i.e. their
    parts are distributed over a site, with links
    between them

Multiple languages problem
  • Documents are often in many different languages

Availability of classification schemas problem
  • If classification is of interest (and of help
    while searching), the underlying classification
    taxonomies are not standardized across collections

5
Most prominent problems with digital libraries
Inaccurate queries problem
  • Users typically lack domain specific knowledge
  • Users dont have proper terminology to hand
  • Users dont include all potential synonyms and
    variations in the query
  • Users have a problem but arent sure how to
    phrase it (i.e. how the same problem is phrased
    in the documents)

On these grounds, its extremely difficult to
provide a perfectly relevant result set as first
response. Intelligent suggestions for refinement
or expansion are needed.
6
Technologies are underway to solve the problems..
Meta data extraction
  • Automatic extraction of keywords
  • Structural analysis
  • Normalization of existing meta data
  • Use external reference data citation analysis

Suffering from chronical rhinitis, the patient
was treated......
Part of speech tagging normalization
Vpart Prep Adj N Det N
Vcop Vpart
Vpart Prep Adj N Det N
Vcop Vpart
Extraction of specific syntactic patterns
P(chronical rhinitis)
Statistic analysis of the extracted patterns
log
P(chronical ) P(rhinits)
Identification of new terminology
chronical rhinitis
7
Technologies are underway to solve the problems..
Meta data extraction
  • Automatic extraction of keywords
  • Structural analysis
  • Normalization of existing meta data
  • Use external reference data citation analysis

Journal of Cancer Research
Journal of Cancer Research
1. Analyse structure
Issue 5, 2003 -12
Issue 5, 2003 -12
Investigations in E. coli
Journal title
Investigations in E. coli
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
Article title
B. C. Abracadabra Department of Molecular
Medicine University of Wisconsin S.
Miheev Analytical Laboratory Russian Academy of
Scieneces Moscow
2. Determine text block features
Affiliation
Abstract
Abstract
In this study we investigate
In this study we investigate
3. Classify text blocks
1. Introduction
1. Introduction
2. Materials and Methods
2. Materials and Methods
4. Apply structure grammar
8
Technologies are underway to solve the problems..
Meta data extraction
  • Automatic extraction of keywords
  • Structural analysis
  • Normalization of existing meta data
  • Use external reference data citation analysis

Artikel 1
Artikel 2
Infer relative importance of Article 5
Artikel 3
Artikel 6
and
Artikel 8
Use textual context of citation to obtain good
descriptors of it
Artikel 7
Artikel 5
Artikel 4
Citation graph
9
Technologies are underway to solve the problems..
Meta data extraction
  • Automatic extraction of keywords
  • Structural analysis
  • Normalization of existing meta data
  • Use external reference data citation analysis

Infer relatedness of Article 8 and Article 7
because they are cited by the same articles
Artikel 1
Artikel 2
Artikel 3
Artikel 6
Artikel 8
Artikel 7
Artikel 5
Artikel 4
Citation graph
10
Technologies are underway to solve the problems..
Equal ranking
  • Test runs with representative queries
  • Check typical ranking position per content
    source
  • Assign static rank boosts per content source,
    based on results

Content source B
  • full text documents
  • indexed in citation index
  • rich meta data

Content source C
Low boost
  • full text documents
  • PDF, DOC ? conversion problems

Content source A
  • only abstracts
  • rich meta data
  • no external references

High boost
Medium boost
Retrieval Engine
High boost
Content source D
Medium boost
  • web data
  • hard to crawl, distributed documents
  • unreliable meta data
  • web anchor text as external reference

Content source E
  • few meta data
  • indexed in citation index
  • full text articles

11
Technologies are underway to solve the problems..
Proper treatment of queries
  • Deal with orthographic variation
  • Deal with morphological variation
  • Deal with vocabulary variation
  • Deal with special-interest queries (e.g.
    restrict on user homepages, find definitions,
    narrow down on articles)

Cerebral infarct / conferences
Cerebral infarkt Serebral infarct Cetebral ingarct
Cerebral infarct
Phrasing
Doc type classification
Spellchecking
Topic classification
Cerebral infarct / medicine Cerebral infarct /
biology
Cerebral infarct
Cerebral disease Infarction
Lemmatization
Synonymy
Thesaurus support
Apoplexy Apoplectic insult Stroke
Character normalization
Cerebral infarcts
Refinement
Infarctus cérébral
Ambigue queries
12
Technologies are underway to solve the problems..
Smart data aggregation
Abstract
Introduction
Journal of Cancer Research
Issue 5, 2003 -12
Chapter 1
Investigations in E. coli
B. C. Abracadabra author info S.
Miheev author info
Chapter 2
Chapter 3
While crawling for documents
Chapter 4
Abstract
Introduction
Journal contents
Chapter 1
Current Issue
Chapter 2
This issue
follow these links and put together a complete
document.
Personal Profile
Chapter 3
Chapter 4
recognize links that point to other parts of the
document
13
Technologies are underway to solve the problems..
Proper treatment of queries (e.g. covering
morphol.semant. variation)
Citation index
Query processing
Query
Document processing
Crawling
INDEX
Doc
Results
Result processing
Query refinement suggestions (e.g. covering
morphol.semant. variation)
Smart data aggregation (e.g. restoring
distributed documents)
Advanced linguistic processing (e.g. terminology
extraction, classification, structural analysis)
14
Scirus
15
Scirus
16
Scirus
17
Evolution of Digital Libraries
  • data base
  • pure predefined meta data
  • exact match
  • data is
  • heterogenuous
  • not normalized
  • incomplete
  • unreliable

Traditional DL
  • inverted index
  • full text
  • exact match
  • data is
  • heterogenuous
  • not normalized
  • redundant
  • unreliable

Full text search engine
  • inverted index linguistics
  • smart data aggregation
  • extracted information
  • fuzzy search
  • data is
  • homogenuous
  • auto-normalized
  • auto-completed
  • reliable

Next generation DL
Write a Comment
User Comments (0)
About PowerShow.com