Title: Extracting Schema From Data
1Extracting Schema From Data
- The difference between schemas for semistructured
data and traditional schemas is that a given
semistructured data can have more than one schema
. - Given a semistructured data, compute
automatically some schema for it, given several
possible answers, we want the schema that best
describes the structure of that particular
data.This is called Schema Extraction.
2- Schema Extraction for schema graphs
- Schema Extraction for Datalog Typings
3- Data Guides
- Our goal is to construct a new OEM graph that is
a finite description of the list of paths. This
is called Data Guide. The two properties to be
fulfilled - Accurate Every path in the data occurs in the
data guide, and every path in the data guide
occurs in the data. - Concise Every path occurs exactly once.
4(No Transcript)
5- We proceed as follows The Data Guide will have a
root node, call it Root. Next we examine one by
one each path in the list and add new nodes to
the data guide, as needed - employee
- employee.name
- employee.manages
- employee.manages.managedby
- employee.manages.managedby.manages
- employee.manages.managedby.manages.managedby
- company
6Root r
Employees p1,p2,p3,p4 p5,p6,p7,p8
employee
position
name
company
phone
managedby
manages
managedby
Regular p2,p3,p5 p7,p8
Boss p1,p4,p6
worksfor
name
worksfor
manages
phone
position
name
Company c
worksfor
name
A Data Guide
7Root
employee
manages
company
Emp
worksfor
managedby
name
Comp
position
phone
name
Schema graph
8Simulation between a data graph and a data guide
Node in data graph
Node in data guide r
Root p1,
p2, p3, p4, p5,
Employee p6, p7, p8 p1, p4, p6
Boss p2, p3, p5,
p7, p8 Regular c
Company
Simulation from the data guide to the schema
graph
Node in data guide
Node in schema graph Root
Root Employee
Emp Boss
Emp Regular
Emp Company
Comp
9- This construction of the data guide
resembles the technique to transform a
nondeterministic finite state automaton into a
deterministic one . -
- The data guide is the most specific schema
graph for that data with the following features - The data guide is a deterministic schema graph.
- Any other deterministic schema graph to which our
data conforms subsumes the data guide.
10A nondeterministic schema
Rootr
employee
employee
company
managedby
Regular p2,p3,p5 p7,p8
Boss p1, p4,p6
worksfor
manages
name
name
phone
Comp c
worksfor
11Extracting Datalog rules from data We have a
semistructured data instance and want to extract
automatically the most specific typing given by a
set of Datalog rules. We create one predicate
for each complex value object in the data. We
create the following predicates pred_r, pred_c,
pred_p1, pred_p2, pred_p3, pred_p4, pred_p5,
pred_p6, pred_p7, pred_p8 corresponding to the
objects r, c, p1, p2, p3, p4, p5, p6,
p7, p8 .
12Next we write a set of Datalog rules defining
each predicate based exactly on the outgoing
edges of its corresponding object pred_r(X) -
ref(X, company, Y), pred_c(Y),
ref(X, employee, Z1), pred_p1(Z1),
ref(X, employee,
Z8), pred_p8(Z8) pred_c(X) - ref(X, name, N),
string(N) pred_p1(X) - ref(X, worksfor, Y),
pred_c(Y), ref(X, name,
N) , string(N), ref(X, phone, P),
string(P), ref(X, manages, Z),
pred_p2(Z), ref(X,
manages, U) , pred_p3(U)
13pred_p2(X) - ref(X , worksfor, Y) , pred_c(Y),
ref(X, name, N) ,
string(N), ref(X,
manageby, Z), pred_p1(Z) pred_p3(X) - ..
We have to compute the largest fixpoint of the
Datalog program on the given data.
14 Object
Predicate
pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pre
d_p5 pred_p6 pred_p7 pred_p8
r c, p1, p2, p3, p4,p5, p6, p7, p8
p1 p2, p3, p5, p7, p8 p3, p5, p7,
p8 p1, p4, p6 p3, p5, p7, p8 p1,
p4, p6 p3, p5, p7, p8 p3, p5, p7,
p8
Extents of predicates after one iteration
15 Object
Predicate
pred_r pred_c pred_p1 pred_p2 pred_p3 pred_p4 pre
d_p5 pred_p6 pred_p7 pred_p8
r c, p1, p2, p3, p4,p5, p6, p7, p8
p1 p2, p3 p3 p1, p4, p6 p3, p5,
p7, p8 p1, p4, p6 p3, p5, p7, p8
p3, p5, p7, p8
Extents of predicates after two iterations
16We obtain the following Datalog rules Root(X)
- ref(X, company, Y), Company(Y), ref(X,
employee, Z1),
Boss1(Z1), ref(X, employee, Z2), Boss2(Z2),
ref(X, employee, U1),
Regular1(U1),..,
ref(X, employee, U3), Regular3(U3) Company
- ref(X, name, N), string(N) Boss1(X) -
ref(X, worksfor, Y), Company(Y), ref(X, name,
N), string(N), ref(X,
phone, P), string(P), ref(X, manages,Z),
Regular1(Z), ref(X, manages, U),
Regular2(U) Boss2(X) - ref(X, worksfor,
Y), Company(Y), ref(X, name, N),
string(N), ref(X, phone, P), string(P),
ref(X, manages,Z),
Regular3(Z) Regular1(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, managedby, Z),
Boss1(Z) Regular2(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, position, P), string(P),
ref(X, managedby, Z),
Boss1(Z) Regular3(X) - ref(X, worksfor, Y),
Company(Y), ref(X, name, N),
string(N), ref(X, position, P), string(P),
ref(X, managedby, Z),
Boss2(Z)
17Inferring Schemas From Queries Some
semistructured data instances are the result of
queries.
Query
Result Schema
Inferring
The following query takes a bibliography file and
constructs a homepage for every author
where bib -gt L -gt X, X - gt author -gt A, X
-gt title -gt T, X -gt year -gt Ycreate Root( ),
HomePage(A), YearEntry(A,Y), PageEntry(X)link
Root() -gt person -gt HomePage(A),
Homepage(A) -gt year -gt YearEntry(A,Y)
YearEntry(A,Y) -gt paper -gt PaperEntry(X)
PaperEntry(X) -gt title -gt T,
PaperEntry(X) -gt author -gt HomePage(A),
PaperEntry(X) -gt year -gt Y
18Root
person
person
Homepage (smith)
Homepage (Jones)
author
author
author
year
year
year
YearEntry (smith,1997)
YearEntry (smith,1995)
YearEntry (smith,1997)
author
paper
paper
paper
paper
PaperEntry (o423)
PaperEntry (o552)
PaperEntry (o153)
year
title
year
year
title
title
19 Schema graph inferred from the query
The schema will have one class for each function
, and one edge for each line in the link clause.
Root
person
HomePage
year
YearEntry
author
paper
PaperEntry
title
year
20 where create Root( ), F(X), F(Y), G(X),
H(Y) link Root( ) -gt A -gt F(X), F(X) -gt C
-gt G(X), Root( ) -gt B -gt F(Y), F(X) -gt
D -gt H(Y)
For the following example
We reach the following schema
Root A F, B F F C G, D H
21 Path Constraints
- In Relational Databases
- in RDB, the relational declaration tell us more
than the types - imposes a key constraint so that no two tuples
have the same key
Example
Create table Employees ( Emp Id integer,
EmpName char(30), DeptId integer,
primary key(EmpId), foreign key(DeptId)
references Departments ) Create table
Departments( DeptID integer, Dname char(10),
primary
key(DeptId)
)
22In Object-Oriented Databases
Interface Publication extent publication
attribute String title attribute Date date
relationship setltAuthorgt auth
---gtinclusion constraints inverse Authorpub
---gtinverse relationship Interface
Author extent author attribute String
title attribute String address
relationship setltPublicationgt pub
---gtinclusion constraints inverse
Publicationauth ---gtinverse
relationship
23- Inclusion constrainsts
- For any publication p, the set p.auth is a subset
of the set author. Similarly, for any author a,
the set a.pub is a subset of publication.
Inverse relationships
- For any publication p, and for any author a in
p.auth, p is a member of a.pub .
- For any author a, and for any publication p in
a.pub, a is a member of p.auth .
24 Illustration of path constraints on
semistructured data
r
publication
publication
author
author
auth
auth
auth
pub
pub
pub
title
title
date
date
name
name
address
address
...
...
...
...
...
...
...
...
25In semistructured data
- inclusion constraint is expressed as follows
- ?p (?a (author(r,a) ? pub(a,p)) -gt
publication(r,p)) - The general form of an inclusion constraint is
- ?x ((r,x)) -gt (r,x))
- inverse relationship is
- ?p ( publication(r,p) -gt ? a(auth(p,a) -gt
pub(a,p)))
The general form of this constraint is
?x ((r,x)) -gt ?y((x,y)-gt (y,x)))
26 Constraints are also important in Query
Optimization. Here is an example Select row
P2from r.publication P1, r.publication
P2, P1.auth Awhere Database Systems
in P1.title and A in P2.auth
Select row Pfrom r.publication P,
P.auth A, A.pub Pwhere Database
Systems in P.title
The query plan implicit in the first one
requires two iterations over publication - with
P1,P2 - whereas the second requests only one
iteration - with P .