Part One XML and Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Part One XML and Databases

Description:

name John /name /person o555. o456. o123. children. children. mother. Names ... Similarity extends to structure as well ( Travolta' NEAR Cage' = Face/Off' ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 31
Provided by: sou59
Category:
Tags: xml | databases | john | one | part | travolta

less

Transcript and Presenter's Notes

Title: Part One XML and Databases


1
Part OneXML and Databases
  • Soumen Chakrabarti
  • CSE, IIT Bombay

2
Form and content
  • The Web today
  • HTML generated by hand, wysisyg editors,
    webified databases
  • HTML specifies rendering for human reading
  • Screen scraping required to consolidate data
  • The Web in the future
  • Common interchange format (XML)
  • Concentrate on content, not form
  • Represent data class broader than relations

3
Role of databases
  • Contribute
  • Data storage and indexing
  • Query processing and optimization
  • Views, transformations, integration
  • Adopt
  • Search modalities
  • Content-based approximate search
  • Linguistic analysis

4
Features of semi-structured data
  • No explicit schema, or volatile schema
  • Schema size comparable to data size
  • Structure changes without notice
  • Heterogeneous, deeply nested, irregular
  • Has nature of documents rather than tables

5
Semi-structured data model example
Bib
o1
complex object
paper
paper
book
references
o12
o24
o29
references
references
author
page
author
year
author
title
http
title
title
publisher
author
author
author
o43
25
96
1997
last
firstname
atomic object
firstname
lastname
first
lastname
243
206
Serge
Abiteboul
Victor
122
133
Vianu
Object Exchange Model (OEM)
6
Syntax
paper author Abiteboul,
author firstname
Victor, last
name Vianu, title Regula
r path queries , page fir
st 122, last 133
7
Some observations
  • Missing or additional attributes
  • Multiple attributes
  • Different types in different objects
  • Heterogeneous collections

8
Object IDs and references
Jane
Maryidrefo123 o555/ othero456John
o456
children
children
mother
o555
o123
9
Names and acronyms
  • OEM (Object Exchange Model) a semi-structured
    data model from Stanford, 1995
  • Lore a system for storing data adhering to the
    OEM
  • Lorel a query language for Lore
  • XML (eXtensible Markup Language) a
    simplification of SGML and a generalization of
    HTML
  • XML-QL Query language for XML

10
Lorel query examples
select Bib.paper.title from Bib.paper where Bib.
paper.year 1995
Alternative
select X.title from Bib.paper X, Bib.(paperbo
ok) Y where Y.author.lastname? Ullman an
d Y.reference X
Navigating partiallyknown structures
Transitive closure
11
XML-QL query examples
where
Morgan Kaufmann
a k in www.a.b.c/bib.xml construct a
where a
in www.a.b.c/bib.xml
construct al
12
XML storage in ternary relation
o1
paper
o2
year
title
author
author
o3
o4
o5
o6
The Calculus


1986
  • Too many joins
  • Label name storage redundant

13
Storage optimization through mining
  • Inline common cases
  • Tolerate a few nulls

14
Schema extraction
  • Schema a template for type/semantics
    specification
  • Conformance
  • Does that data conform to a given schema ?
  • Classification
  • If so, which objects belong to what
    classes/types?
  • Applications
  • Storage and query optimization

15
Graph simulation
  • Given two edge-labeled graphs G1 and G2, a
    simulation is a relation R between nodes such
    that if (x1, x2) is in R, and (x1, a, y1) is in
    G1, then there exists (x2, a, y2) in G2 (same
    label) such that (y1,y2) is in R

R
G1
G2
a
y1
16
Upper and lower bound schema
  • Lower bound schema
  • Conformance find simulation R from S to D
  • Classification check if (c,x) in R
  • Used in storage optimization
  • Upper bound schema (data guides)
  • Conformance find simulation R from D to S
  • Classification check if (x,c) in R
  • Used in path index generation and query
    optimization

17
Sample data
r
employee
employee
employee
employee
employee
employee
employee
employee
manages
manages
manages
manages
manages
p8
p1
p2
p3
p4
p5
p6
p7
managedby
managedby
managedby
managedby
managedby
worksfor
worksfor
worksfor
worksfor
worksfor
company
worksfor
worksfor
worksfor
c
18
Lower bound schema
Root r
employee
company
employee
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
19
Storage using lower bound schema
Lower-bound schema
Store rest in overflow graph
20
Upper bound schema (DataGuides)
Root r
employee
Employees p1,p1,p3,P4 p5,p6,p7,p8
company
manages
managedby
worksfor
Bosses p1,p4,p6
Regulars p2,p3,p5,p7,p8
manages
managedby
worksfor
Company c
worksfor
21
Query optimization issues
Select x from A.B x where exists y in x.C y5
D
A
A
A
D
B
B
D
D
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
5
5
5
4
4
5
4
4
5
22
What makes the problem difficult
  • Selectivity estimation
  • Index selection
  • Access cost models
  • Clustering choices

23
Part Two Information Retrieval and Databases
  • Soumen Chakrabarti
  • CSE, IIT Bombay

24
Information retrieval (IR)
  • Search
  • Inverted index
  • Boolean match
  • Relevance ranking
  • Classification
  • Learn topics from examples
  • Clustering
  • Discover topics from a document collection
  • Never done inside a relational database

D5 3, 37, 50
cat
D7 9, 20
dog
D7 7, 90, 400
D20 22, 533
25
Current style of loose integration
  • RDBMS provides hooks
  • Declare some columns as textual with keyword
    index
  • Inserts, updates, and deletes trigger external
    program, e.g., Verity search engine
  • Search engine maintains separate indices
  • Simple query rewriting to combine relational and
    text-match where-clauses

26
Reasons
  • Space
  • BLOB vs. pure relational representation
  • Average English word is only 5 bytes
  • Time
  • Most text engines are resigned to flexible (i.e.,
    no) model for data consistency
  • Much faster read-only access than relational
    database lookups

27
New features desired
  • Operations that are more complex than keyword
    search can benefit from tighter coupling with
    RDBMS
  • Approximate search is essential (Anand Rajaraman,
    Amazon.com, SIGMOD 99)
  • Misspelling book title, author name common
  • Variant of OEM edge label (author/writer/poet)
  • Similarity extends to structure as well
    (Travolta NEAR Cage Face/Off)

28
Case study generalized like
  • SQL has limited string matching constructs
  • like x, x, x
  • x must be exact match
  • Need more lenient match
  • Applications LDAP, IR
  • String edit distance is not suitable
  • Given query, order strings in database in
    increasing order of edit distance and pick top 5

29
Sliding-window matching
nas
asc
sce
cen
ent
pas
sca
cal
ras
rascal
nascent
pascal
  • Given a query, scan to get a set of 3-grams
  • Similarity of string in database to query
    number of shared 3-grams

30
Issues
  • Minimally disruptive architecture
  • Low storage overheads
  • Fast query processing
  • Good selectivity estimates
  • Combining with other predicates for ranking
  • Efficiently handling updates
Write a Comment
User Comments (0)
About PowerShow.com