Indexing Strategies for the Linguist - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing Strategies for the Linguist

Description:

Indexing Strategies for the Linguist s Search Engine Aaron Elkiss and Philip Resnik UMIACS Why a Linguist s Search Engine? Goal for linguists: Use naturally ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 44
Provided by: Aaron219
Category:

less

Transcript and Presenter's Notes

Title: Indexing Strategies for the Linguist


1
Indexing Strategies for the Linguists Search
Engine
  • Aaron Elkiss and Philip Resnik
  • UMIACS

2
Why a Linguists Search Engine?
  • Goal for linguists Use naturally occurring data
    to support theories
  • Bag of word searches not sufficient
  • Structural searches of parse trees would be better

3
Constituency Parse
4
A Web Search Tool for the Ordinary Working
Linguist
  • Database
  • Must permit real-time interaction
  • Must permit large-scale searches
  • Must allow search on linguistic criteria
  • Interface
  • Must have linguist-friendly look and feel
  • Must minimize learning/ramp-up time
  • Must be reliable
  • Must evolve with real use

5
Querying Parse Trees
  • Find all trees containing a particular subtree
  • We use Query by Example to edit an example
    sentence
  • to the structure were interested in

6
Query Properties
  • Typically concerned with structure near the
    leaves of the tree
  • Relationship can be ancestorship rather than
    immediate dominance

7
LSE Design Criteria
  • Must permit arbitrary structural searches
  • multiple branches with wildcards
  • in realtime
  • on a large collection of sentences
  • 1GB scaling up to 10GB or more

8
Existing Techniques
  • Convert data to a relational model
  • Streaming techniques (tgrep2 (Rohde), XSQ
    (Chawathe et al.))
  • Index, but permit only simple searches
    (DataGuides Widom et al.)
  • Indexing techniques work best with a simple schema

9
Goals
  • Must handle a dataset with a very large schema
  • 17 million paths from root to terminal
  • Xmark 1GB has 2.4 million
  • Path lengths also longer in LSE
  • Set of paths from root to preterminal fixed in
    Xmark, grows without bound in LSE
  • Must handle queries with wildcards well
  • Must retrieve all results (100 recall)

10
Assumptions
  • Indexing can be slow (overnight)
  • Doesnt need to support online update
  • Can overgenerate results
  • lt 100 precision
  • Use tgrep2 as a filter

11
Baseline Solution
  • VIST A dynamic index method for querying XML
    data by tree structures (Wang et al (IBM Watson),
    SIGMOD 2003)
  • Suffix-tree based approach
  • Indexes structure and content together
  • Supports branching queries well

12
Suffix Trees
  • Index all suffixes of a given string

13
Structure Encoded Sequences
  • Represent each node in DFS order with the
    complete path from the root to the node
  • One parse tree one document one structure
    encoded sequence

S1 S_S1 NP_S_S1 NNP_S_S1
Jared_NNP_NP_S_S1 VP_S_S1 VBD_S_S1
laughed_VBD_VP_S_S1
14
VIST Trees
  • Insert structure encoded sequences instead of
    suffixes of a string

15
Node Identification
  • (DFS order / node ID , number of descendants)
    (n, d)
  • DFS order uniquely identifies a node
  • with number of descendants, identifies which
    nodes are descendants of a given node
  • can produce without using a lot of memory using
    perl and UNIX sort utility

16
VIST Indexes
  • Two Btree indexes using BerkeleyDB
  • Structural Sequence Index
  • Document Index

17
Structural Sequence Index
  • Structural Sequence Element ? (n, d)
  • S1 ? (0,12)
  • VP_S_S1 ? (5,2), (10,2)

18
Document Index
  • documents inserted at node ID of last element

7 ?
12 ?
19
Search
Query
  • Order of branches in query is important

20
Recursion Base Case
  • After the last branch of the query
  • Retrieve documents with descendant node IDs

7 ?
21
Peculiarities of VIST
  • Precision is not 100!
  • Query
  • matches both these documents

22
Problematic Query - Wildcards
  • Wildcards can still be a problem
  • Recursion isnt deep but can be very wide
  • End up looking at same nodes over and over again
    with different wildcard instantiations from
    previous branches

23
Problematic Query - Wildcards
24
Problematic Query Common Terminal
  • VISTs structural index actually stores
  • terminal length root preterminal
  • the 6 S1 S VP FRAG X DT
  • to find instantiated prefixes of structural
    sequence elements
  • Wed look for
  • JJR 5 S1 S VP FRAG X

25
Problematic Query Common Terminal
  • To find structural sequence elements like
    the_DT_X_FRAG_ we have to look at every element
    with the terminal the
  • 220284 for the_ vs. 121 for the_DT_X_frag_

26
Solution Overview
  • Ignore insufficiently selective query branches
  • Reorder processing of query branches
  • Different ordering for structural index
  • Create in-memory tree for the query
  • Memoization of nodes matching subtree of query

27
Ignore query branches
  • Generate statistics for each pair of tokens
  • Calculate estimated selectivity of each branch
  • Discard insufficiently selective branches
  • Use tgrep2 as filter

Still problematic
28
Reorder query branches
  • Start processing with most selective branch
  • Join to proceeding branches, then following
    branches

29
Reorder structural index
  • Store as
  • terminal preterminal root
  • the DT X FRAG VP S S1
  • Immediately find paths with particular suffix
  • Terminals occurring in similar contexts are
    clustered together

30
Reorder structural index
  • Now we have to look at every JJR_X_FRAG_ instead
    of just those with the same prefix as
    the_DT_X_FRAG_
  • But well only do so once, and only keep those
    the_DT_X_FRAG_ and JJR_X_FRAG_ who have
    matching prefixes

31
Create Query Tree
  • Keep relevant instantiations of each branch in
    memory

S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_
X_S1 robot_NN_NP_NP_S_SBAR_VP_FRAG_S1
robot_NN_NP_NP_S_SBAR_VP_S_S_S1 S1__VP VP_S_
S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
_us us_PRP_NP VP_VP_S_SBAR_NP_PP_
NP_PP_VP_S_S1 _laughs laughs_VBZ
_us us_PRP_NP
32
Subtree Memoization
  • Create sorted list of all nodes for a particular
    branch of the query

S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_X_S1
(1,15) (30,10) S1__VP VP_S_S1
_laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
(5,5) VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
_laughs laughs_VBZ (20,0)
S1__VP__laughs (5,5) (20,0)
33
Subtree Memoization
  • Specifier for memoized list includes wildcard
    instantiations

S1__VP VP_S_S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP (5,5)
(10,0) _us
us_PRP_NP (6,0)
us_PRP_NP_NP (50,0) VP_VP_S_SBAR
_NP_PP_NP_PP_VP_S_S1 _laughs
laughs_VBZ (20,20) _us
us_PRP_NP (60,0)
S1__VP__us / VP_S_S1 (6,0)
(50,0)
S1__VP__us / VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
(60,0)
34
Evaluation
  • Original VIST scalability
  • XMark
  • LSE data

35
Original VIST scalability
Random queries over a synthetic data set
From Haixun Wang, Sanghyun Park, Wei Fan, and
Philip S Yu. VIST A dynamic index method for
querying XML data by tree structures. In SIGMOD,
2003. http//citeseer.nj.nec.com/wang03vist.html
36
Evaluation - VIST
  • Scales extremely well for Xmark
  • qn vs. qnc cached vs. non-cached
  • Queries same form as XPath queries from
    original VIST paper
  • Q1 /site//itemlocationUS/mail/datetext12/
    15/1999 (3.7s)
  • Q2 /site//person//citytextPocatello (2.5s)
  • Q3 //closed_auctionpersonperson1/datetex
    t12/15/1999 (4.1s)

37
Evaluation - LSE
  • Need more data
  • Queries two forms of a real LSE query

Q1
Q2
38
Evaluation Index Size
39
Future Directions
  • Reimplement this original VIST in C
  • Scale up to 10gb
  • Improved query planning
  • Ranking efficient top-k results
  • Investigate usefulness for structural search of
    HTML documents

40
HTML Structural Search
  • Similar properties to LSE data
  • no fixed schema
  • no maximum path depth
  • Whole Web search probably not yet feasible

41
Ranking efficient top-k results
  • Assign score to possible result
  • Closer to matrix level higher score?
  • Look for results with highest score first

42
Improved Query Planning
  • Dynamic Ignorance
  • choose whether to use a query branch based on
    wildcard instantiations
  • Full reordering of query branches

43
Acknowledgments
  • Philip Resnik, of course!
  • Saurabh Khandelwal tree editor
  • Doug Rohde tgrep2
  • This work is supported by NSF ITR grant
    IIS0113641 .
Write a Comment
User Comments (0)
About PowerShow.com