Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 10 Searching Book Collections
2Course Administration
Midterm Examination Wednesday, September 26
from 730 to 900, in Phillips Hall
203. Material from lectures and discussion
papers, up to this class. See the examination
page on the web site for more information and a
sample examination from a previous year. Open
book. Laptop computers may be used (a) to store
lecture slides, and papers read in the discussion
classes, (b) as a calculator. No electronic
device may be used for any form of communication.
No connections to the Internet or the Web.
3Book Collections The Problem
Suppose that by 2015, every book in the Cornell
University Library has been digitized. How will
we search the content of this corpus? Most
information retrieval methods have been developed
and tested with fairly small documents (less than
1,000 words) catalog records abstracts news
items web pages A typical book has about 100,000
words.
4Book Collections
Accurate Full Text Text is transcribed or
obtained from publishers' files. Structural
mark-up and perhaps formatting mark-up. No
errors. Scanned and Digitized Image created by
scanning. Text extracted by optical character
recognition (with errors). Minimal metadata
except pagination.
5Structural Mark-up Example
The SICK
ROSE O Rose
thou art sick. The invisible
worm, That flies in the
night In the howling
storm
from
C M Sperberg-McQueen and Lou Burnard. TEI
Consortium
6Structural Mark-up Example
Shakespeare Macbe
th number"VII" Macbeths
castle Will I with wine and
wassail ... N
ote that Macbeth appears in two different
contexts from Manning, et al., Chapter 10
7XML Document Object Model
root element play
element author
element title
text Shakespeare
text Macbeth
continued on next slide
from Manning, et al., Chapter 10
8XML Document Object Model (continued)
element act
attribute number "I"
element scene
element title
element verse
attribute number "VII"
text Macbeth's castle
text Will I with...
from Manning, et al., Chapter 10
9Schemas and DTDs
Schemas and Document Type Definitions are
alternative ways to defines what entities and
elements are allowable in a particular class of
documents and the base character set encoding
used. In a large corpus of books There are
many document types, e.g., anthologies, plays,
collections of papers, etc. Even for a
single document type, there may be many different
schemas or DTDs, e.g., each publisher may
have its own DTD. See CS/Info 431 for XML
schemas.
10Challenges Indexing Unit
What is the indexing unit? With an anthology of
poems, the indexing unit might be a poem. With
a collection of research papers, the indexing
unit might be a paper. But... With a collection
of plays is the unit a play, an act, a
scene? What is the indexing unit of an academic
monograph, or a text book, or a novel? With
scanned images can we use a page as an indexing
unit?
11Indexing Unit Options
- 1. Structured document retrieval principle. A
system should always retrieve the most specific
part of a document answering the query. - 2. Standard element. Designate one XML element
as the document unit for indexing. (E.g.,
, or with scanned books, select
as the indexing unit.) - 3. Pseudodocuments. Group nodes into
non-overlapping pseudodocuments. (See next two
slides for an example.)
12Pseudodocuments
root element play
element author
element title
text Shakespeare
text Macbeth
continued on next slide
from Manning, et al., Chapter 10
13Pseudodocuments (continued)
element act
attribute number "I"
element scene
element title
element verse
attribute number "VII"
text Macbeth's castle
text Will I with...
from Manning, et al., Chapter 10
14Challenges User Interface Design
If all documents in the corpus have the same
schema, this can be incorporated into the user
interface design. Example. Library catalogs
use a predefined set of field names, which are
well known to experienced users, e.g., author,
title, date, ISBN, LCCN. If the documents use a
wide variety of schemas, or the users do not
understand the schema, they cannot make use of
the mark up in formulating their queries, or they
may not receive the results that they require.
Manning, et al. states that users are very bad
at remembering details about the schema and at
constructing queries that comply with the schema.
15Vector Space Model for XML Retrieval
The context of a leaf node is the path from the
root element to the leaf, e.g.,
play/act/scene/title. A structural term is a
term in context, e.g., play/act/scene/title/"Mac
beth" To distinguish the context in which a term
is used, define a vector space in which each
distinct structural term is a separate dimension.
For example dimension 1 play/act/scene/title/
"Macbeth" dimension 2 play/title/"Macbeth"
16Vector Space Model for XML Retrieval (continued)
The context resemblance function, cr, is a
measure of how closely the context of a term in a
query matches the context of a term in the
document. One suggested measure is
1 q 1
d where q and d are the number of nodes in
the query path and document path. The function
returns 0 if the query path cannot be extended to
match the document path.
cr(q, d)
17Vector Space Model for XML Retrieval (continued)
This introduction to the vector space model for
XML retrieval is based on Manning, et al.,
Chapter 10. For determining similarity they
suggest a context resemblance similarity measure
that uses weights for both the context and the
term similarity. (See Chapter 10 for a suggested
form of this measure.) Note that there are
methodological difficulties in using idf for XML
retrieval. Manning, et al. suggest calculating a
separate idf value for each structural term, but
that suffers from problems of lack of data.
18Google Book Search
Initially, the search services in Google Book
Search were poor The lack of a natural
indexing unit leads to confusion. Web search
makes heavy use of the links between web pages.
With books, there are no such links between
indexing units. With scanned books, there are
significant error rates in optical character
recognition and format recognition. The search
services are steadily improving as Google
(a) imports metadata from external sources,
e.g., catalog records (b) adds structural
metadata to identify indexing units.