Why%20Can - PowerPoint PPT Presentation

About This Presentation
Title:

Why%20Can

Description:

... of dogs (without stemming) or canine (and its stems) 'dogs' canine ... Evaluates #combine(dog canine) for each extent associated with the section context ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 47
Provided by: cro55
Category:
Tags: 20can

less

Transcript and Presenter's Notes

Title: Why%20Can


1
Why Cant We All Get Along?(Structured Data and
Information Retrieval)
  • Bruce Croft
  • Computer Science Department
  • University of Massachusetts Amherst

2
Overview
  • History of structured data in IR
  • Conceptual similarities and differences
  • What is the goal?
  • The Indri System
  • Examples using IR for structured data
  • XML retrieval
  • Relevance models
  • Entity retrieval

3
History
  • IR systems have had Boolean field restrictions
    since 1970s
  • metadata date, type, source, keywords
  • content structure title, body
  • Implementing IR systems using a relational DBMS
    first done in the 70s
  • Crawford and McCleod, 1978-1983
  • Efficiency issues with this approach persisted
    until 90s (e.g. DeFazio et al, SIGIR 95)
  • Inquery IR system successfully used object
    management system (Brown, SIGIR 95)

4
History
  • Modifying DBMS model to incorporate probabilities
    to integrate DB/IR
  • e.g. probabilistic relational algebra (Fuhr and
    Rolleke, ACM TOIS 1994)
  • e.g. probabilistic datalog (Fuhr, SIGIR 95)
  • Text retrieval as a SQL function in commercial
    DBMSs
  • e.g. Oracle, early 90s

5
History
  • Ranked retrieval of complex documents
  • e.g. office documents with structure and
    significant text content (Croft, Krovetz and
    Turtle, IPM 1990)
  • Bayesian inference net model to combine evidence
    from different parts of document structure (Croft
    and Turtle, EDT 1992)
  • e.g. marked-up documents (Croft, Smith, and
    Turtle, SIGIR 1992)
  • XML retrieval
  • INEX (2002)

6
Similarities and Differences
  • Common interest in providing efficient access to
    information on a very large scale
  • indexing and optimization key topics
  • Until recently, concern about effectiveness
    (accuracy) of access was domain of IR
  • Focus on structured vs. unstructured data is
    historically true but less relevant today
  • Statistical inference and ranking are central to
    IR, becoming more important in DB

7
Similarities and Differences
  • IR systems have focused on providing access to
    information rather than answers
  • e.g. Web search
  • evaluation typically based on topical relevance
    and user relevance rather than correctness
    (except QA)
  • IR works with multiple databases but not multiple
    relations
  • IR query languages more like calculus than
    algebra
  • Integrity, security, concurrency are central for
    DB, less so in IR

8
What is the Goal?
  • One unified information system?
  • i.e. a single conceptual and formal framework to
    support the entire range of information needs
  • at least a grand challenge
  • or is it the Web?
  • An integrated DB/IR system?
  • i.e. extend database model to fully support
    statistical inference and ranking
  • a major challenge given established systems and
    models

9
What is the Goal?
  • An IR system with extended capability for
    structured data
  • i.e. extend IR model to include combination of
    evidence from structured and unstructured
    components of complex objects (documents)
  • backend database system used to store objects
    (cf. one hand clapping)
  • many applications look like this (e.g. desktop
    search, web shopping)
  • users seem to prefer this approach (simple
    queries or forms and ranking)

10
What is the Goal?
  • What about important database functionality?
  • Source data can be stored in databases
  • Extended IR system will construct separate
    indexes
  • What about optimization?
  • Search engines worry about optimization!
  • Can incorporate ideas from DB optimization
  • What about updates?
  • Search engines worry about updates!
  • Backend database system still available
  • What about joins?
  • Interesting. Treat IR objects as a view?

11
Indri A Candidate IR System
  • Indri is a separate, downloadable component of
    the Lemur Toolkit
  • Influences
  • INQUERY Callan, et. al. 92
  • Inference network framework
  • Query language
  • Lemur http//www.lemurproject.org
  • Language modeling (LM) toolkit
  • Lucene http//jakarta.apache.org/lucene/docs/inde
    x.html
  • Popular off the shelf Java-based IR system
  • Based on heuristic retrieval models
  • Designed for new retrieval environments
  • i.e. GALE, CALO, AQUAINT, Web retrieval, and XML
    retrieval

12
Zoology 101
  • The indri is the largest type of lemur
  • When first spotted the natives yelled Indri!
    Indri!
  • Malagasy for "Look!  Over there!"

13
Design Goals
  • Off the shelf (Windows, NIX, Mac platforms)
  • Simple to set up and use
  • Fully functional API w/ language wrappers for
    Java, etc
  • Robust retrieval model
  • Inference net language modeling Metzler and
    Croft 04
  • Powerful query language
  • Designed to be simple to use, yet support complex
    information needs
  • Provides adaptable, customizable scoring
  • Scalable
  • Highly efficient code
  • Distributed retrieval
  • Incremental update

14
Model
  • Based on original inference network retrieval
    framework Turtle and Croft 91
  • Casts retrieval as inference in simple graphical
    model
  • Extensions made to original model
  • Incorporation of probabilities based on language
    modeling rather than tf.idf
  • Multiple language models allowed in the network
    (one per indexed context)

15
Model
Model hyperparameters (observed)
a,ßbody
Document node (observed)
D
a,ßh1
a,ßtitle
Context language models
?title
?body
?h1



r1
rN
r1
rN
r1
rN
q1
q2
Representation nodes(terms, phrases, etc)
Belief nodes(combine, not, max)
Information need node(belief node)
I
16
Model
a,ßbody
D
a,ßh1
a,ßtitle
?title
?body
?h1



r1
rN
r1
rN
r1
rN
q1
q2
I
17
P( r ? )
  • Probability of observing a term, phrase, or
    feature given a context language model
  • ri nodes are binary
  • Assume r Bernoulli( ? )
  • Model B Metzler, Lavrenko, Croft 04

18
Model
I
19
P( ? a, ß, D )
  • Prior over context language model determined by
    a, ß
  • Assume P( ? a, ß ) Beta( a, ß )
  • Bernoullis conjugate prior
  • ar µP( r C ) 1
  • ßr µP( r C ) 1
  • µ is a free parameter

20
Model
a,ßbody
D
a,ßh1
a,ßtitle
?title
?body
?h1



r1
rN
r1
rN
r1
rN
q1
q2
I
21
P( q r ) and P( I r )
  • Belief nodes are created dynamically based on
    query
  • Belief node estimates are derived from standard
    link matrices
  • Combine evidence from parents in various ways
  • Allows fast inference by making marginalization
    computationally tractable
  • Information need node is simply a belief node
    that combines all network evidence into a single
    value
  • Documents are ranked according to P( I a, ß, D)

22
Example AND
P(Qtruea,b) A B
0 false false
0 false true
0 true false
1 true true
A
B
Q
23
Query Language
  • Extension of INQUERY query language
  • Structured query language
  • Term weighting
  • Ordered / unordered windows
  • Synonyms
  • Additional features
  • Language modeling motivated constructs
  • Added flexibility to deal with fields via
    contexts
  • Generalization of passage retrieval (extent
    retrieval)

24
Document Representation
lthtmlgt ltheadgt lttitlegtDepartment
Descriptionslt/titlegt lt/headgt ltbodygt The following
list describes lth1gtAgriculturelt/h1gt
lth1gtChemistrylt/h1gt lth1gtComputer Sciencelt/h1gt
lth1gtElectrical Engineeringlt/h1gt
lth1gtZoologylt/h1gt lt/bodygt lt/htmlgt
lttitlegtdepartment descriptionslt/titlegt
lttitlegt context
lttitlegtextents
1. department descriptions
ltbodygtthe following list describes
lth1gtagriculturelt/h1gt lt/bodygt
ltbodygt context
ltbodygtextents
1. the following list describes
lth1gtagriculture lt/h1gt
lth1gtagriculturelt/h1gt lth1gtchemistrylt/h1gt
lth1gtzoologylt/h1gt
lth1gt context
lth1gtextents
1. agriculture 2. chemistry 36. zoology
. . .
25
Terms
Type Example Matches
Stemmed term dog All occurrences of dog (and its stems)
Surface term dogs Exact occurrences of dogs (without stemming)
Term group (synonym group) ltdogs caninegt All occurrences of dogs (without stemming) or canine (and its stems)
POS qualified term ltdogs caninegt.NNS Same as previous, except matches must also be tagged with the NNS POS tag
26
Proximity
Type Example Matches
odN(e1 em) or N(e1 em) od5(dog cat) or 5(dog cat) All occurrences of dog and cat appearing ordered within a window of 5 words
uwN(e1 em) uw5(dog cat) All occurrences of dog and cat that appear in any order within a window of 5 words
phrase(e1 em) phrase(1(willy wonka) uw3(chocolate factory)) System dependent implementation (defaults to odm)
syntaxxx(e1 em) syntaxnp(fresh powder) System dependent implementation
27
Context Restriction
Example Matches
dog.title All occurrences of dog appearing in the title context
dog.title,paragraph All occurrences of dog appearing in both a title and paragraph contexts (may not be possible)
ltdog.title dog.paragraphgt All occurrences of dog appearing in either a title context or a paragraph context
5(dog cat).head All matching windows contained within a head context
28
Context Evaluation
Example Evaluated
dog.(title) The term dog evaluated using the title context as the document
dog.(title, paragraph) The term dog evaluated using the concatenation of the title and paragraph contexts as the document
dog.figure(paragraph) The term dog restricted to figure tags within the paragraph context.
29
Belief Operators
INQUERY INDRI
sum / and combine
wsum weight
or or
not not
max max
wsum is still available in INDRI, but should
be used with discretion
30
Extent Retrieval
Example Evaluated
combinesection(dog canine) Evaluates combine(dog canine) for each extent associated with the section context
combinetitle, section(dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context
sum(sumsection(dog)) Returns a single score that is the sum of the scores returned from sum(dog) evaluated for each section extent
max(sumsection(dog)) Same as previous, except returns the maximum score
31
Extent Retrieval Example
Querycombinesection( dirichlet smoothing )
ltdocumentgt ltsectiongtltheadgtIntroductionlt/headgt Stat
istical language modeling allows formal methods
to be applied to information retrieval. ... lt/sect
iongt ltsectiongtltheadgtMultinomial Modellt/headgt Here
we provide a quick review of multinomial language
models. ... lt/sectiongt ltsectiongtltheadgtMultiple-Ber
noulli Modellt/headgt We now examine two formal
methods for statistically modeling documents and
queries based on the multiple-Bernoulli
distribution. ... lt/sectiongt lt/documentgt
  1. Treat each section extent as a document
  2. Score each document according to combine( )
  3. Return a ranked list of extents.

0.15
0.50
0.05
SCORE DOCID BEGIN END0.50 IR-352 51 2050.35 IR-3
52 405 5480.15 IR-352 0 50
32
Indri Examples
  • Where was George Washington born?
  • combinesentence( 1( george washington )
  • born anyplace )
  • Paragraphs from news feed articles published
    between 1991 and 2000 that mention a person, a
    monetary amount, and the company InfoCom
  • filreq(band( NewsFeed.doctype
  • datebetween(1991 2000) )
  • combineparagraph( anyperson

  • anymoney InfoCom ) )

33
Example Indri Web Query
weight( 0.1 weight( 1.0
prior(pagerank) 0.75 prior(inlinks) ) 1.0
weight( 0.9 combine(
wsum( 1 stellwagen.(inlink)
1 stellwagen.(title)
3 stellwagen.(mainbody) 1
stellwagen.(heading) ) wsum( 1
bank.(inlink) 1
bank.(title) 3
bank.(mainbody) 1
bank.(heading) ) ) 0.1 combine(
wsum( 1 uw8( stellwagen bank
).(inlink) 1 uw8(
stellwagen bank ).(title)
3 uw8( stellwagen bank ).(mainbody)
1 uw8( stellwagen bank ).(heading) )
) ) )
34
Examples of Using IR for Structured Data
  • XML search
  • Relevance models for incomplete data
  • Extracted entity retrieval

35
XML Search
  • INEX workshop is similar to TREC but focused on
    XML documents
  • Queries contain varying degrees of structural
    specification
  • Not clear that these queries are realistic
  • earlier study showed that people are not good
    about remembering structure
  • document structure can provide valuable evidence
    for content representation

36
Example INEX Query
37
Hierarchical Language Models
  • Estimate a language model for each component of a
    document tree (Ogilvie 2004, 2005)
  • Smooth using a weighted mixture of a background
    model, a document model, a parent model, and a
    mixture of the children models

38
Hierarchical Language Models
39
Does it work?
Results from Ogilvie, 2003
40
Does it work?
Results from Ogilvie, 2003
41
Indri INEX extensions
  • Indri incorporates hierarchical language models
  • Allows weights to be set for different language
    models and component type
  • Query language extended to reference parent and
    child extents
  • use the .\field operator to access a child
    reference
  • use the ./field operator to access a parent
    reference
  • use the .//field operator to access an ancestor
    reference
  • e.g. combinesection( bootstrap
    combine./title( methodology ) )

42
Relevance Models for Incomplete Data
  • Relevance models (Lavrenko, 2001) are used for
    query expansion in IR based on generative LMs
  • Estimates dependencies between words based on
    training set or initial ranking
  • Recently extended to semi-structured data for
    applications where records are missing data
    (Lavrenko, Yi, Allan, 2006)
  • e.g. NSDL collection with fields title,
    description, subject, content, audience
  • 24 of 650,000 records have no subject field, 30
    no author, 96 no audience

43
Relevance Models for Incomplete Data
  • Basic process is to estimate relevance models for
    each field based on training data for a query,
    then rank test records based on comparison to
    relevance models
  • Relevance model estimates how likely it is that a
    word occurs in a field of a record, given that a
    record matches the specified query fields
  • Ranking is done using a weighted cross-entropy
  • weights reflect importance of field

44
Relevance Models for Incomplete Data
  • In NSDL experiment, 127 queries of form
  • subjectphilosophy AND audiencehigh
    school
  • In test collection, all records had subject and
    audience field values removed
  • Retrieved records had precision of 30 in top 10,
    compared to 15 for a baseline that ranked text
    records containing all fields
  • Shows potential of probabilistic models for this
    type of application
  • can also generate structured queries (Calado et
    al, CIKM 02)

45
Extracted Entity Retrieval
  • Information extraction extracts structure from
    text
  • e.g. names, addresses, email addresses, CVs,
    publications, tables
  • Creates semi-structured (and noisy) data rather
    than databases
  • Table extraction can be the basis for question
    answering (Wei, Croft and McCallum, 2006)
  • Publication extraction is the basis of
    CITESEER-like systems (e.g. REXA, McCallum, 2005)
  • Person extraction can be the basis for expert
    finding

46
Expert Finding
  • Evaluated in TREC Enterprise Track
  • People are represented by text that co-occurs
    with names
  • which names? what text?
  • People are ranked for a query using the text
    profile
  • Relevance model approach is effective

47
Conclusion
  • For many applications involving retrieval of
    semi-structured data, the right approach is an IR
    system based on a probabilistic retrieval model
    as a front-end, and a database system as the
    back-end
  • but IR system is not implemented using database
    system
  • Right means gives effective results and
    supports users world view
  • IR systems based on language models (e.g. Indri)
    are a good candidate
Write a Comment
User Comments (0)
About PowerShow.com