Title: A Fast Index for Semistructured Data
1A Fast Index for Semistructured Data
- Brian Cooper, Neal Sample, Michael Franklin,
Gisli Hjaltson, Moshe Shadmon
Slides and Presentation by Ram Nandyala 26 May
2005
2Overview
- Introduction
- Index Fabric
- Indexing XML with the Index Fabric
- Experimental Results
- Conclusion
3Introduction
- What is semistructured data?
- data with an irregular or changing organization
- does not conform to a particular schema
- How to manage semistructured data?
- RDBMS
- Oracle, etc.
- Specialized data managers
- Lore, Tamino, XYZFind
4Index Fabric
- Encodes XML data paths as strings
- Features
- Does not impose a rigid structure or data model
on the information being indexed - Can be used to represent the complex
relationships that often exist among data
elements. - Can augment simple search paths with descriptive
and semantic information to enhance the richness
of queries
5Index Fabric (2)
- PATRICIA trie
- compressed trie nodes with only one child are
removed - indexes large number of strings in a compact and
efficient structure - instead of storing entire key, indexes
differences between keys hence, slow growth
6Index Fabric (3)
- PATRICIA trie
- nodes labeled with depth - the character position
in the key represented by the path - size of trie is independent of key lengths
- agressive (but lossy) compression
0
a
c
2
t
l
r
at
california
car
cat
7Index Fabric (4)
0
a
c
- But.
- The tree can get very unbalanced
2
t
r
at
3
e
i
car
cat
4
m
f
calexico
california
calimesa
8Index Fabric (5)
0
a
c
- Solution
- Divide the trie into block-sized sub-tries
- Index sub-tries with a second trie
ca
2
t
r
at
3
cal
e
i
car
cat
4
cali
m
f
calexico
california
calimesa
9Index Fabric (6)
0
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
10Index Fabric (7)
- Searching Index Fabric
- Start from root node of left-most layer
- If edge is far link, proceed horizontally to
block in next layer - If no labeled edge matches, follow direct link to
new block in next layer - Continue to rightmost layer until desired data is
found (or not)
11Index Fabric (8)
0
Search california
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
12Index Fabric (8)
0
Search california
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
13Index Fabric (8)
0
Search california
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
14Index Fabric (8)
0
Search california
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
Must verify!
far link
15Index Fabric (8)
0
Search citation
a
c
0
ca
2
c
t
r
2
at
2
3
cal
l
cal
e
i
3
l
car
cat
4
Not a match!!
cali
i
m
4
f
calexico
Layer 2
Layer 1
Layer 0
california
calimesa
direct link
far link
16Index Fabric (9)
- Searching (continued)
- Ideally, one block per layer accessed
17Index Fabric (10)
0
a
c
2
0
t
l
r
c
at
california
car
cat
18Index Fabric (11)
c
0
a
c
2
0
t
l
r
at
l
4
i
2
m
car
cat
california
calimesa
19Indexing XML with the Index Fabric
- Encoding Scheme
- Designators
- Designator Dictionary
- Raw Paths
- Refined Paths
20Indexing XML with the Index Fabric (2)
- Designators
- Special characters or strings used to encode data
paths - Unique designator assigned to each tag and
attrribute appearing in XML document - e.g. I ltinvoicegt B ltbuyergt N ltnamegt
ltinvoicegtltbuyergtltnamegtABC Corplt/namegtlt/buyergtlt/i
nvoicegt Is encoded as I B N ABC Corp
21Indexing XML with the Index Fabric (3)
- Designator Dictionary
- mapping between tag names and their assigned
designators - when XML document is parsed, tag names matched to
designators - tag names from queries translated to designators
to form search key
22Indexing XML with the Index Fabric (4)
- Raw Paths
- Index the hierarchical structure of XML by
encoding root-to-leaf paths as strings - assume no a priori knowledge of queries or
structure - Prefix-encoding scheme
23Indexing XML with the Index Fabric (5)
ltinvoicegt ltbuyergt ltnamegtOracle
Inclt/namegt ltphonegt555-1212lt/phonegt
lt/buyergt ltsellergt ltnamegtIBMlt/namegt
lt/sellergt ltitemgt ltcountgt4lt/countgt
ltnamegtnaillt/namegt lt/itemgt ltinvoicegt
ltinvoicegt ltbuyergt ltnamegtABC Corplt/namegt
ltaddressgt1 Industrial Waylt/addressgt
lt/buyergt ltsellergt ltnamegtAcme Inclt/namegt
ltaddressgt2 Acme Rd.lt/addressgt lt/sellergt
ltitem count3gtsawlt/itemgt ltitem
count2gtdrilllt/itemgt ltinvoicegt
Doc. 1
Doc. 2
ltinvoicegt I ltbuyergt B ltsellergt S ltnamegt N ltaddr
essgt A ltphonegt P ltitemgt T ltcountgt C count C
IBNOracle Inc IBP555-1212 ISNIBM ITC4 ITNnail
IBNABC Corp IBA1 Industrial Way ISNAcme Inc ISA2
Acme Rd. ITC3 ITsaw ITC2 ITdrill
24Indexing XML with the Index Fabric (6)
25Indexing XML with the Index Fabric (7)
- Raw Paths (continued)
- New documents can be added any time
26Indexing XML with the Index Fabric (8)
- Refined Paths
- Specialized paths that optimize frequently
occurring queries - Encoded similar to raw paths
- Stored in same index as raw paths
27Indexing XML with the Index Fabric (9)
- Refined Paths (continued)
- Example Find invoices where Acme Inc. sold to
ABC Corp. - Designator, say Z, is assigned to this path, and
encoded with information in query - Index Fabric key is Z Acme Inc. ABC Corp.
28Indexing XML with the Index Fabric (10)
- Improving Query Processing
- Raw Paths
- Simple path expressions
- General path expressions
- Refined Paths
- Query processor recognizes path expression as
refined path, translated to search key
29Experimental Results
- Index Fabric
- Native RDBMS index (B-tree)
- Edge mapping
- STORED system
- Tested using data from DBLP archive
30Experimental Results (2)
- Basic Edge Mapping
- treats XML as set of nodes and edges
- two tables
- roots(id, label)
- edges(parentid, childid, label)
- key-compressed B-tree indexes
- roots(id), roots(label)
- edges(parentid), edges(childid), edges(label)
31Experimental Results (3)
- STORED system
- partial schema extracted from XML, using data
mining - nonconforming data stored in overflow buckets, in
similar method as edge mapping - key-compressed B-tree indexes
32Experimental Results (4)
33Experimental Results (5)
34Experimental Results (6)
35Experimental Results (7)
- Query B find conference paper by author
36Experimental Results (8)
- Query D find publications by co-authors
37Conclusion
- Fast Index for efficiently accessing XML and
other semistructured data - Outperforms existing mechanisms of handling
semistructured data, sometimes on an order of
magnitude