Title: Indexing Strategies for the Linguist
1Indexing Strategies for the Linguists Search
Engine
- Aaron Elkiss and Philip Resnik
- UMIACS
2Why a Linguists Search Engine?
- Goal for linguists Use naturally occurring data
to support theories - Bag of word searches not sufficient
- Structural searches of parse trees would be better
3Constituency Parse
4A Web Search Tool for the Ordinary Working
Linguist
- Database
- Must permit real-time interaction
- Must permit large-scale searches
- Must allow search on linguistic criteria
- Interface
- Must have linguist-friendly look and feel
- Must minimize learning/ramp-up time
- Must be reliable
- Must evolve with real use
5Querying Parse Trees
- Find all trees containing a particular subtree
- We use Query by Example to edit an example
sentence - to the structure were interested in
6Query Properties
- Typically concerned with structure near the
leaves of the tree - Relationship can be ancestorship rather than
immediate dominance
7LSE Design Criteria
- Must permit arbitrary structural searches
- multiple branches with wildcards
- in realtime
- on a large collection of sentences
- 1GB scaling up to 10GB or more
8Existing Techniques
- Convert data to a relational model
- Streaming techniques (tgrep2 (Rohde), XSQ
(Chawathe et al.)) - Index, but permit only simple searches
(DataGuides Widom et al.) - Indexing techniques work best with a simple schema
9Goals
- Must handle a dataset with a very large schema
- 17 million paths from root to terminal
- Xmark 1GB has 2.4 million
- Path lengths also longer in LSE
- Set of paths from root to preterminal fixed in
Xmark, grows without bound in LSE - Must handle queries with wildcards well
- Must retrieve all results (100 recall)
10Assumptions
- Indexing can be slow (overnight)
- Doesnt need to support online update
- Can overgenerate results
- lt 100 precision
- Use tgrep2 as a filter
11Baseline Solution
- VIST A dynamic index method for querying XML
data by tree structures (Wang et al (IBM Watson),
SIGMOD 2003) - Suffix-tree based approach
- Indexes structure and content together
- Supports branching queries well
12Suffix Trees
- Index all suffixes of a given string
13Structure Encoded Sequences
- Represent each node in DFS order with the
complete path from the root to the node - One parse tree one document one structure
encoded sequence
S1 S_S1 NP_S_S1 NNP_S_S1
Jared_NNP_NP_S_S1 VP_S_S1 VBD_S_S1
laughed_VBD_VP_S_S1
14VIST Trees
- Insert structure encoded sequences instead of
suffixes of a string
15Node Identification
- (DFS order / node ID , number of descendants)
(n, d) - DFS order uniquely identifies a node
- with number of descendants, identifies which
nodes are descendants of a given node - can produce without using a lot of memory using
perl and UNIX sort utility
16VIST Indexes
- Two Btree indexes using BerkeleyDB
- Structural Sequence Index
- Document Index
17Structural Sequence Index
- Structural Sequence Element ? (n, d)
- S1 ? (0,12)
- VP_S_S1 ? (5,2), (10,2)
18Document Index
- documents inserted at node ID of last element
7 ?
12 ?
19Search
Query
- Order of branches in query is important
20Recursion Base Case
- After the last branch of the query
- Retrieve documents with descendant node IDs
7 ?
21Peculiarities of VIST
- Precision is not 100!
- Query
- matches both these documents
22Problematic Query - Wildcards
- Wildcards can still be a problem
- Recursion isnt deep but can be very wide
- End up looking at same nodes over and over again
with different wildcard instantiations from
previous branches
23Problematic Query - Wildcards
24Problematic Query Common Terminal
- VISTs structural index actually stores
- terminal length root preterminal
- the 6 S1 S VP FRAG X DT
- to find instantiated prefixes of structural
sequence elements - Wed look for
- JJR 5 S1 S VP FRAG X
25Problematic Query Common Terminal
- To find structural sequence elements like
the_DT_X_FRAG_ we have to look at every element
with the terminal the - 220284 for the_ vs. 121 for the_DT_X_frag_
26Solution Overview
- Ignore insufficiently selective query branches
- Reorder processing of query branches
- Different ordering for structural index
- Create in-memory tree for the query
- Memoization of nodes matching subtree of query
27Ignore query branches
- Generate statistics for each pair of tokens
- Calculate estimated selectivity of each branch
- Discard insufficiently selective branches
- Use tgrep2 as filter
Still problematic
28Reorder query branches
- Start processing with most selective branch
- Join to proceeding branches, then following
branches
29Reorder structural index
- Store as
- terminal preterminal root
- the DT X FRAG VP S S1
- Immediately find paths with particular suffix
- Terminals occurring in similar contexts are
clustered together
30Reorder structural index
- Now we have to look at every JJR_X_FRAG_ instead
of just those with the same prefix as
the_DT_X_FRAG_ - But well only do so once, and only keep those
the_DT_X_FRAG_ and JJR_X_FRAG_ who have
matching prefixes
31Create Query Tree
- Keep relevant instantiations of each branch in
memory
S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_
X_S1 robot_NN_NP_NP_S_SBAR_VP_FRAG_S1
robot_NN_NP_NP_S_SBAR_VP_S_S_S1 S1__VP VP_S_
S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
_us us_PRP_NP VP_VP_S_SBAR_NP_PP_
NP_PP_VP_S_S1 _laughs laughs_VBZ
_us us_PRP_NP
32Subtree Memoization
- Create sorted list of all nodes for a particular
branch of the query
S1__NP__robot robot_NN_NP_NP_S_SBAR_S_X_X_S1
(1,15) (30,10) S1__VP VP_S_S1
_laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP
(5,5) VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
_laughs laughs_VBZ (20,0)
S1__VP__laughs (5,5) (20,0)
33Subtree Memoization
- Specifier for memoized list includes wildcard
instantiations
S1__VP VP_S_S1 _laughs
laughs_VBZ_VP_VP_S_SBAR_NP_PP_NP_PP (5,5)
(10,0) _us
us_PRP_NP (6,0)
us_PRP_NP_NP (50,0) VP_VP_S_SBAR
_NP_PP_NP_PP_VP_S_S1 _laughs
laughs_VBZ (20,20) _us
us_PRP_NP (60,0)
S1__VP__us / VP_S_S1 (6,0)
(50,0)
S1__VP__us / VP_VP_S_SBAR_NP_PP_NP_PP_VP_S_S1
(60,0)
34Evaluation
- Original VIST scalability
- XMark
- LSE data
35Original VIST scalability
Random queries over a synthetic data set
From Haixun Wang, Sanghyun Park, Wei Fan, and
Philip S Yu. VIST A dynamic index method for
querying XML data by tree structures. In SIGMOD,
2003. http//citeseer.nj.nec.com/wang03vist.html
36Evaluation - VIST
- Scales extremely well for Xmark
- qn vs. qnc cached vs. non-cached
- Queries same form as XPath queries from
original VIST paper - Q1 /site//itemlocationUS/mail/datetext12/
15/1999 (3.7s) - Q2 /site//person//citytextPocatello (2.5s)
- Q3 //closed_auctionpersonperson1/datetex
t12/15/1999 (4.1s)
37Evaluation - LSE
- Queries two forms of a real LSE query
Q1
Q2
38Evaluation Index Size
39Future Directions
- Reimplement this original VIST in C
- Scale up to 10gb
- Improved query planning
- Ranking efficient top-k results
- Investigate usefulness for structural search of
HTML documents
40HTML Structural Search
- Similar properties to LSE data
- no fixed schema
- no maximum path depth
- Whole Web search probably not yet feasible
41Ranking efficient top-k results
- Assign score to possible result
- Closer to matrix level higher score?
- Look for results with highest score first
42Improved Query Planning
- Dynamic Ignorance
- choose whether to use a query branch based on
wildcard instantiations - Full reordering of query branches
43Acknowledgments
- Philip Resnik, of course!
- Saurabh Khandelwal tree editor
- Doug Rohde tgrep2
- This work is supported by NSF ITR grant
IIS0113641 .