Why%20Can

About This Presentation

Title:

Why%20Can

Description:

... of dogs (without stemming) or canine (and its stems) 'dogs' canine ... Evaluates #combine(dog canine) for each extent associated with the section context ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 47

Provided by: cro55

Learn more at: https://dbirday.stern.nyu.edu

Category:

Tags: 20can

more less

Transcript and Presenter's Notes

Title: Why%20Can

1
Why Cant We All Get Along?(Structured Data and
Information Retrieval)

Bruce Croft
Computer Science Department
University of Massachusetts Amherst

2
Overview

History of structured data in IR
Conceptual similarities and differences
What is the goal?
The Indri System
Examples using IR for structured data
XML retrieval
Relevance models
Entity retrieval

3
History

IR systems have had Boolean field restrictions
since 1970s
metadata date, type, source, keywords
content structure title, body
Implementing IR systems using a relational DBMS
first done in the 70s
Crawford and McCleod, 1978-1983
Efficiency issues with this approach persisted
until 90s (e.g. DeFazio et al, SIGIR 95)
Inquery IR system successfully used object
management system (Brown, SIGIR 95)

4
History

Modifying DBMS model to incorporate probabilities
to integrate DB/IR
e.g. probabilistic relational algebra (Fuhr and
Rolleke, ACM TOIS 1994)
e.g. probabilistic datalog (Fuhr, SIGIR 95)
Text retrieval as a SQL function in commercial
DBMSs
e.g. Oracle, early 90s

5
History

Ranked retrieval of complex documents
e.g. office documents with structure and
significant text content (Croft, Krovetz and
Turtle, IPM 1990)
Bayesian inference net model to combine evidence
from different parts of document structure (Croft
and Turtle, EDT 1992)
e.g. marked-up documents (Croft, Smith, and
Turtle, SIGIR 1992)
XML retrieval
INEX (2002)

6
Similarities and Differences

Common interest in providing efficient access to
information on a very large scale
indexing and optimization key topics
Until recently, concern about effectiveness
(accuracy) of access was domain of IR
Focus on structured vs. unstructured data is
historically true but less relevant today
Statistical inference and ranking are central to
IR, becoming more important in DB

7
Similarities and Differences

IR systems have focused on providing access to
information rather than answers
e.g. Web search
evaluation typically based on topical relevance
and user relevance rather than correctness
(except QA)
IR works with multiple databases but not multiple
relations
IR query languages more like calculus than
algebra
Integrity, security, concurrency are central for
DB, less so in IR

8
What is the Goal?

One unified information system?
i.e. a single conceptual and formal framework to
support the entire range of information needs
at least a grand challenge
or is it the Web?
An integrated DB/IR system?
i.e. extend database model to fully support
statistical inference and ranking
a major challenge given established systems and
models

9
What is the Goal?

An IR system with extended capability for
structured data
i.e. extend IR model to include combination of
evidence from structured and unstructured
components of complex objects (documents)
backend database system used to store objects
(cf. one hand clapping)
many applications look like this (e.g. desktop
search, web shopping)
users seem to prefer this approach (simple
queries or forms and ranking)

10
What is the Goal?

What about important database functionality?
Source data can be stored in databases
Extended IR system will construct separate
indexes
What about optimization?
Search engines worry about optimization!
Can incorporate ideas from DB optimization
What about updates?
Search engines worry about updates!
Backend database system still available
What about joins?
Interesting. Treat IR objects as a view?

11
Indri A Candidate IR System

Indri is a separate, downloadable component of
the Lemur Toolkit
Influences
INQUERY Callan, et. al. 92
Inference network framework
Query language
Lemur http//www.lemurproject.org
Language modeling (LM) toolkit
Lucene http//jakarta.apache.org/lucene/docs/inde
x.html
Popular off the shelf Java-based IR system
Based on heuristic retrieval models
Designed for new retrieval environments
i.e. GALE, CALO, AQUAINT, Web retrieval, and XML
retrieval

12
Zoology 101

The indri is the largest type of lemur
When first spotted the natives yelled Indri!
Indri!
Malagasy for "Look! Over there!"

13
Design Goals

Off the shelf (Windows, NIX, Mac platforms)
Simple to set up and use
Fully functional API w/ language wrappers for
Java, etc
Robust retrieval model
Inference net language modeling Metzler and
Croft 04
Powerful query language
Designed to be simple to use, yet support complex
information needs
Provides adaptable, customizable scoring
Scalable
Highly efficient code
Distributed retrieval
Incremental update

14
Model

Based on original inference network retrieval
framework Turtle and Croft 91
Casts retrieval as inference in simple graphical
model
Extensions made to original model
Incorporation of probabilities based on language
modeling rather than tf.idf
Multiple language models allowed in the network
(one per indexed context)

15
Model
Model hyperparameters (observed)
a,ßbody
Document node (observed)
D
a,ßh1
a,ßtitle
Context language models
?title
?body
?h1

r1
rN
r1
rN
r1
rN
q1
q2
Representation nodes(terms, phrases, etc)
Belief nodes(combine, not, max)
Information need node(belief node)
I
16
Model
a,ßbody
D
a,ßh1
a,ßtitle
?title
?body
?h1

r1
rN
r1
rN
r1
rN
q1
q2
I
17
P( r ? )

Probability of observing a term, phrase, or
feature given a context language model
ri nodes are binary
Assume r Bernoulli( ? )
Model B Metzler, Lavrenko, Croft 04

18
Model
I
19
P( ? a, ß, D )

Prior over context language model determined by
a, ß
Assume P( ? a, ß ) Beta( a, ß )
Bernoullis conjugate prior
ar µP( r C ) 1
ßr µP( r C ) 1
µ is a free parameter

20
Model
a,ßbody
D
a,ßh1
a,ßtitle
?title
?body
?h1

r1
rN
r1
rN
r1
rN
q1
q2
I
21
P( q r ) and P( I r )

Belief nodes are created dynamically based on
query
Belief node estimates are derived from standard
link matrices
Combine evidence from parents in various ways
Allows fast inference by making marginalization
computationally tractable
Information need node is simply a belief node
that combines all network evidence into a single
value
Documents are ranked according to P( I a, ß, D)

22
Example AND
P(Qtruea,b) A B
0 false false
0 false true
0 true false
1 true true
A
B
Q
23
Query Language

Extension of INQUERY query language
Structured query language
Term weighting
Ordered / unordered windows
Synonyms
Additional features
Language modeling motivated constructs
Added flexibility to deal with fields via
contexts
Generalization of passage retrieval (extent
retrieval)

24
Document Representation
lthtmlgt ltheadgt lttitlegtDepartment
Descriptionslt/titlegt lt/headgt ltbodygt The following
list describes lth1gtAgriculturelt/h1gt
lth1gtChemistrylt/h1gt lth1gtComputer Sciencelt/h1gt
lth1gtElectrical Engineeringlt/h1gt
lth1gtZoologylt/h1gt lt/bodygt lt/htmlgt
lttitlegtdepartment descriptionslt/titlegt
lttitlegt context
lttitlegtextents
1. department descriptions
ltbodygtthe following list describes
lth1gtagriculturelt/h1gt lt/bodygt
ltbodygt context
ltbodygtextents
1. the following list describes
lth1gtagriculture lt/h1gt
lth1gtagriculturelt/h1gt lth1gtchemistrylt/h1gt
lth1gtzoologylt/h1gt
lth1gt context
lth1gtextents
1. agriculture 2. chemistry 36. zoology
. . .
25
Terms
Type Example Matches
Stemmed term dog All occurrences of dog (and its stems)
Surface term dogs Exact occurrences of dogs (without stemming)
Term group (synonym group) ltdogs caninegt All occurrences of dogs (without stemming) or canine (and its stems)
POS qualified term ltdogs caninegt.NNS Same as previous, except matches must also be tagged with the NNS POS tag
26
Proximity
Type Example Matches
odN(e1 em) or N(e1 em) od5(dog cat) or 5(dog cat) All occurrences of dog and cat appearing ordered within a window of 5 words
uwN(e1 em) uw5(dog cat) All occurrences of dog and cat that appear in any order within a window of 5 words
phrase(e1 em) phrase(1(willy wonka) uw3(chocolate factory)) System dependent implementation (defaults to odm)
syntaxxx(e1 em) syntaxnp(fresh powder) System dependent implementation
27
Context Restriction
Example Matches
dog.title All occurrences of dog appearing in the title context
dog.title,paragraph All occurrences of dog appearing in both a title and paragraph contexts (may not be possible)
ltdog.title dog.paragraphgt All occurrences of dog appearing in either a title context or a paragraph context
5(dog cat).head All matching windows contained within a head context
28
Context Evaluation
Example Evaluated
dog.(title) The term dog evaluated using the title context as the document
dog.(title, paragraph) The term dog evaluated using the concatenation of the title and paragraph contexts as the document
dog.figure(paragraph) The term dog restricted to figure tags within the paragraph context.
29
Belief Operators
INQUERY INDRI
sum / and combine
wsum weight
or or
not not
max max
wsum is still available in INDRI, but should
be used with discretion
30
Extent Retrieval
Example Evaluated
combinesection(dog canine) Evaluates combine(dog canine) for each extent associated with the section context
combinetitle, section(dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context
sum(sumsection(dog)) Returns a single score that is the sum of the scores returned from sum(dog) evaluated for each section extent
max(sumsection(dog)) Same as previous, except returns the maximum score
31
Extent Retrieval Example
Querycombinesection( dirichlet smoothing )
ltdocumentgt ltsectiongtltheadgtIntroductionlt/headgt Stat
istical language modeling allows formal methods
to be applied to information retrieval. ... lt/sect
iongt ltsectiongtltheadgtMultinomial Modellt/headgt Here
we provide a quick review of multinomial language
models. ... lt/sectiongt ltsectiongtltheadgtMultiple-Ber
noulli Modellt/headgt We now examine two formal
methods for statistically modeling documents and
queries based on the multiple-Bernoulli
distribution. ... lt/sectiongt lt/documentgt

Treat each section extent as a document
Score each document according to combine( )
Return a ranked list of extents.

0.15
0.50
0.05
SCORE DOCID BEGIN END0.50 IR-352 51 2050.35 IR-3
52 405 5480.15 IR-352 0 50
32
Indri Examples

Where was George Washington born?
combinesentence( 1( george washington )
born anyplace )
Paragraphs from news feed articles published
between 1991 and 2000 that mention a person, a
monetary amount, and the company InfoCom
filreq(band( NewsFeed.doctype
datebetween(1991 2000) )
combineparagraph( anyperson
anymoney InfoCom ) )

33
Example Indri Web Query
weight( 0.1 weight( 1.0
prior(pagerank) 0.75 prior(inlinks) ) 1.0
weight( 0.9 combine(
wsum( 1 stellwagen.(inlink)
1 stellwagen.(title)
3 stellwagen.(mainbody) 1
stellwagen.(heading) ) wsum( 1
bank.(inlink) 1
bank.(title) 3
bank.(mainbody) 1
bank.(heading) ) ) 0.1 combine(
wsum( 1 uw8( stellwagen bank
).(inlink) 1 uw8(
stellwagen bank ).(title)
3 uw8( stellwagen bank ).(mainbody)
1 uw8( stellwagen bank ).(heading) )
) ) )
34
Examples of Using IR for Structured Data

XML search
Relevance models for incomplete data
Extracted entity retrieval

35
XML Search

INEX workshop is similar to TREC but focused on
XML documents
Queries contain varying degrees of structural
specification
Not clear that these queries are realistic
earlier study showed that people are not good
about remembering structure
document structure can provide valuable evidence
for content representation

36
Example INEX Query
37
Hierarchical Language Models

Estimate a language model for each component of a
document tree (Ogilvie 2004, 2005)
Smooth using a weighted mixture of a background
model, a document model, a parent model, and a
mixture of the children models

38
Hierarchical Language Models
39
Does it work?
Results from Ogilvie, 2003
40
Does it work?
Results from Ogilvie, 2003
41
Indri INEX extensions

Indri incorporates hierarchical language models
Allows weights to be set for different language
models and component type
Query language extended to reference parent and
child extents
use the .\field operator to access a child
reference
use the ./field operator to access a parent
reference
use the .//field operator to access an ancestor
reference
e.g. combinesection( bootstrap
combine./title( methodology ) )

42
Relevance Models for Incomplete Data

Relevance models (Lavrenko, 2001) are used for
query expansion in IR based on generative LMs
Estimates dependencies between words based on
training set or initial ranking
Recently extended to semi-structured data for
applications where records are missing data
(Lavrenko, Yi, Allan, 2006)
e.g. NSDL collection with fields title,
description, subject, content, audience
24 of 650,000 records have no subject field, 30
no author, 96 no audience

43
Relevance Models for Incomplete Data

Basic process is to estimate relevance models for
each field based on training data for a query,
then rank test records based on comparison to
relevance models
Relevance model estimates how likely it is that a
word occurs in a field of a record, given that a
record matches the specified query fields
Ranking is done using a weighted cross-entropy
weights reflect importance of field

44
Relevance Models for Incomplete Data

In NSDL experiment, 127 queries of form
subjectphilosophy AND audiencehigh
school
In test collection, all records had subject and
audience field values removed
Retrieved records had precision of 30 in top 10,
compared to 15 for a baseline that ranked text
records containing all fields
Shows potential of probabilistic models for this
type of application
can also generate structured queries (Calado et
al, CIKM 02)

45
Extracted Entity Retrieval

Information extraction extracts structure from
text
e.g. names, addresses, email addresses, CVs,
publications, tables
Creates semi-structured (and noisy) data rather
than databases
Table extraction can be the basis for question
answering (Wei, Croft and McCallum, 2006)
Publication extraction is the basis of
CITESEER-like systems (e.g. REXA, McCallum, 2005)
Person extraction can be the basis for expert
finding

46
Expert Finding

Evaluated in TREC Enterprise Track
People are represented by text that co-occurs
with names
which names? what text?
People are ranked for a query using the text
profile
Relevance model approach is effective

47
Conclusion

For many applications involving retrieval of
semi-structured data, the right approach is an IR
system based on a probabilistic retrieval model
as a front-end, and a database system as the
back-end
but IR system is not implemented using database
system
Right means gives effective results and
supports users world view
IR systems based on language models (e.g. Indri)
are a good candidate