Extracting Schema From Data - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Extracting Schema From Data

Description:

The difference between schemas for semistructured data and ... Any other deterministic schema graph to which our data conforms subsumes the data guide. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 27

Provided by: joseemend

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Schema From Data

1
Extracting Schema From Data

The difference between schemas for semistructured
data and traditional schemas is that a given
semistructured data can have more than one schema
.
Given a semistructured data, compute
automatically some schema for it, given several
possible answers, we want the schema that best
describes the structure of that particular
data.This is called Schema Extraction.

Schema Extraction for schema graphs
Schema Extraction for Datalog Typings

Data Guides
Our goal is to construct a new OEM graph that is
a finite description of the list of paths. This
is called Data Guide. The two properties to be
fulfilled
Accurate Every path in the data occurs in the
data guide, and every path in the data guide
occurs in the data.
Concise Every path occurs exactly once.

4
(No Transcript)
5

We proceed as follows The Data Guide will have a
root node, call it Root. Next we examine one by
one each path in the list and add new nodes to
the data guide, as needed
employee
employee.name
employee.manages
employee.manages.managedby
employee.manages.managedby.manages
employee.manages.managedby.manages.managedby
company

6
Root r
Employees p1,p2,p3,p4 p5,p6,p7,p8
employee
position
name
company
phone
managedby
manages
managedby
Regular p2,p3,p5 p7,p8
Boss p1,p4,p6
worksfor
name
worksfor
manages
phone
position
name
Company c
worksfor
name
A Data Guide
7
Root
employee
manages
company
Emp
worksfor
managedby
name
Comp
position
phone
name
Schema graph
8
Simulation between a data graph and a data guide
Node in data graph
Node in data guide r
Root p1,
p2, p3, p4, p5,
Employee p6, p7, p8 p1, p4, p6
Boss p2, p3, p5,
p7, p8 Regular c

Company
Simulation from the data guide to the schema
graph
Node in data guide
Node in schema graph Root
Root Employee
Emp Boss
Emp Regular

Emp Company
Comp
9

This construction of the data guide
resembles the technique to transform a
nondeterministic finite state automaton into a
deterministic one .
The data guide is the most specific schema
graph for that data with the following features
The data guide is a deterministic schema graph.
Any other deterministic schema graph to which our
data conforms subsumes the data guide.

10
A nondeterministic schema
Rootr
employee
employee
company
managedby
Regular p2,p3,p5 p7,p8
Boss p1, p4,p6

worksfor
manages
name
name
phone
Comp c
worksfor
11
Extracting Datalog rules from data We have a
semistructured data instance and want to extract
automatically the most specific typing given by a
set of Datalog rules. We create one predicate
for each complex value object in the data. We
create the following predicates pred_r, pred_c,
pred_p1, pred_p2, pred_p3, pred_p4, pred_p5,
pred_p6, pred_p7, pred_p8 corresponding to the
objects r, c, p1, p2, p3, p4, p5, p6,
p7, p8 .
12
Next we write a set of Datalog rules defining
each predicate based exactly on the outgoing
edges of its corresponding object pred_r(X) -
ref(X, company, Y), pred_c(Y),
ref(X, employee, Z1), pred_p1(Z1),
ref(X, employee,
Z8), pred_p8(Z8) pred_c(X) - ref(X, name, N),
string(N) pred_p1(X) - ref(X, worksfor, Y),
pred_c(Y), ref(X, name,
N) , string(N), ref(X, phone, P),
string(P), ref(X, manages, Z),
pred_p2(Z), ref(X,
manages, U) , pred_p3(U)
13
pred_p2(X) - ref(X , worksfor, Y) , pred_c(Y),
ref(X, name, N) ,
string(N), ref(X,
manageby, Z), pred_p1(Z) pred_p3(X) - ..
We have to compute the largest fixpoint of the
Datalog program on the given data.
14
Object
Predicate
pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pre
d_p5 pred_p6 pred_p7 pred_p8
r c, p1, p2, p3, p4,p5, p6, p7, p8
p1 p2, p3, p5, p7, p8 p3, p5, p7,
p8 p1, p4, p6 p3, p5, p7, p8 p1,
p4, p6 p3, p5, p7, p8 p3, p5, p7,
p8
Extents of predicates after one iteration
15
Object
Predicate
pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pre
d_p5 pred_p6 pred_p7 pred_p8
r c, p1, p2, p3, p4,p5, p6, p7, p8
p1 p2, p3 p3 p1, p4, p6 p3, p5,
p7, p8 p1, p4, p6 p3, p5, p7, p8
p3, p5, p7, p8
Extents of predicates after two iterations
16
We obtain the following Datalog rules Root(X)
- ref(X, company, Y), Company(Y), ref(X,
employee, Z1),
Boss1(Z1), ref(X, employee, Z2), Boss2(Z2),
ref(X, employee, U1),
Regular1(U1),..,
ref(X, employee, U3), Regular3(U3) Company
- ref(X, name, N), string(N) Boss1(X) -
ref(X, worksfor, Y), Company(Y), ref(X, name,
N), string(N), ref(X,
phone, P), string(P), ref(X, manages,Z),
Regular1(Z), ref(X, manages, U),
Regular2(U) Boss2(X) - ref(X, worksfor,
Y), Company(Y), ref(X, name, N),
string(N), ref(X, phone, P), string(P),
ref(X, manages,Z),
Regular3(Z) Regular1(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, managedby, Z),
Boss1(Z) Regular2(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, position, P), string(P),
ref(X, managedby, Z),
Boss1(Z) Regular3(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, position, P), string(P),
ref(X, managedby, Z),
Boss2(Z)

17
Inferring Schemas From Queries Some
semistructured data instances are the result of
queries.
Query
Result Schema
Inferring
The following query takes a bibliography file and
constructs a homepage for every author
where bib -gt L -gt X, X - gt author -gt A, X
-gt title -gt T, X -gt year -gt Ycreate Root( ),
HomePage(A), YearEntry(A,Y), PageEntry(X)link
Root() -gt person -gt HomePage(A),
Homepage(A) -gt year -gt YearEntry(A,Y)
YearEntry(A,Y) -gt paper -gt PaperEntry(X)
PaperEntry(X) -gt title -gt T,
PaperEntry(X) -gt author -gt HomePage(A),
PaperEntry(X) -gt year -gt Y
18
Root
person
person
Homepage (smith)
Homepage (Jones)
author
author
author
year
year
year
YearEntry (smith,1997)
YearEntry (smith,1995)
YearEntry (smith,1997)
author
paper
paper
paper
paper
PaperEntry (o423)
PaperEntry (o552)
PaperEntry (o153)
year
title
year
year
title
title
19

Schema graph inferred from the query
The schema will have one class for each function
, and one edge for each line in the link clause.
Root
person
HomePage
year
YearEntry
author
paper
PaperEntry
title
year
20
where create Root( ), F(X), F(Y), G(X),
H(Y) link Root( ) -gt A -gt F(X), F(X) -gt C
-gt G(X), Root( ) -gt B -gt F(Y), F(X) -gt
D -gt H(Y)
For the following example
We reach the following schema
Root A F, B F F C G, D H
21
Path Constraints

In Relational Databases
in RDB, the relational declaration tell us more
than the types
imposes a key constraint so that no two tuples
have the same key

Example
Create table Employees ( Emp Id integer,
EmpName char(30), DeptId integer,
primary key(EmpId), foreign key(DeptId)
references Departments ) Create table
Departments( DeptID integer, Dname char(10),

primary
key(DeptId)
)
22
In Object-Oriented Databases
Interface Publication extent publication
attribute String title attribute Date date
relationship setltAuthorgt auth
---gtinclusion constraints inverse Authorpub
---gtinverse relationship Interface
Author extent author attribute String
title attribute String address
relationship setltPublicationgt pub
---gtinclusion constraints inverse
Publicationauth ---gtinverse
relationship
23

Inclusion constrainsts
For any publication p, the set p.auth is a subset
of the set author. Similarly, for any author a,
the set a.pub is a subset of publication.

Inverse relationships

For any publication p, and for any author a in
p.auth, p is a member of a.pub .

For any author a, and for any publication p in
a.pub, a is a member of p.auth .

24

Illustration of path constraints on
semistructured data
r
publication
publication
author
author
auth
auth
auth
pub
pub
pub
title
title
date
date
name
name
address
address
...
...
...
...
...
...
...
...
25
In semistructured data

inclusion constraint is expressed as follows
?p (?a (author(r,a) ? pub(a,p)) -gt
publication(r,p))
The general form of an inclusion constraint is
?x ((r,x)) -gt (r,x))

inverse relationship is
?p ( publication(r,p) -gt ? a(auth(p,a) -gt
pub(a,p)))

The general form of this constraint is
?x ((r,x)) -gt ?y((x,y)-gt (y,x)))
26
Constraints are also important in Query
Optimization. Here is an example Select row
P2from r.publication P1, r.publication
P2, P1.auth Awhere Database Systems
in P1.title and A in P2.auth
Select row Pfrom r.publication P,
P.auth A, A.pub Pwhere Database
Systems in P.title
The query plan implicit in the first one
requires two iterations over publication - with
P1,P2 - whereas the second requests only one
iteration - with P .

Write a Comment

User Comments (0)