DTDDirected Publishing and Integration - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

DTDDirected Publishing and Integration

Description:

Coping with both type and integrity constraints. Attribute Integration Grammar (ATG ... incapable of coping with a predefined DTD (e.g. recursion) ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 65
Provided by: CIS471
Category:

less

Transcript and Presenter's Notes

Title: DTDDirected Publishing and Integration


1
DTD-Directed Publishing and Integration
  • Wenfei Fan
  • Internet Research Dept. Bell Laboratories
  • Dept. of CIS, Temple University

2
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars (ATGs)
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems
  • Joint work with
  • Michael Benedikt, Chen Yong Chan, Rajeev
    Rastogi, Bell Labs
  • Aoying Zhou, Shihui Zheng, Fudan University

3
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

4
The need for XML publishing
  • Most data is stored in existing databases
  • Web data is often managed by DBMS facilities
  • Data needs to be exported to XML -- the prime
    standard for data exchange

XML
XML
DB
DB
5
Simple XML publishing
  • Export relational data in a canonical format
  • Patient
  • ltpatientgt
  • lttuplegt ltSSNgt 123 lt/gt ltnamegt Joe lt/gt
    ltpolicygt LU23 lt/gt
  • lt/tuplegt
  • lttuplegt ltSSNgt 345 lt/gt ltnamegt Mary lt/gt
    ltpolicygt AT45 lt/gt
  • lt/tuplegt
  • lt/patientgt

6
A little harder insurance company and hospital
  • Daily report
  • Relational database R at the hospital
  • Patient (SSN, name, tname, policy)
  • treatment (SSN, tname)
  • ltreportgt
  • ltpatientgt ltSSNgt 123 lt/gt ltnamegt Joe lt/gt
    ltpolicygt LU23 lt/gt
  • lttnamegt MRI lt/gt . . .
    lttnamegt EEG lt/gt
  • lt/patientgt
  • . . .
  • lt/reportgt

hospital
insurance company
XML view
R
XML
grouping treatments under each patient
7
XML publishing
  • An XML view definition language specifying
    desired mapping
  • Efficient implementation of the view language

Q XML view
8
Existing XML publishing tools (1)
  • Embedding single SQL query in XSL stylesheet
  • Result canonical, flat XML
  • Commercial system
  • Oracle 9i XML SQL facilities
  • DB2
  • Incapable of expressing practical XML publishing
    default, flat XML

9
Existing XML publishing tools (2)
  • Extending SQL with XML constructors
  • select XML-aggregation
  • from R1, . . ., Rn
  • where conditions
  • Tools
  • Commercial system IBM DB2 XML extender
  • Middleware (vendor-independent) XPERANTO
  • Verbose and cumbersome
  • small document tedious
  • large documents unthinkable

10
Existing XML publishing tools (3)
  • Annotated template embedding SQL in a fixed XML
    tree
  • Middleware SilkRoute
  • Commercial SQL Server 2000, IBM DB2 DAD
  • Advantages
  • More modular comparing to the universal
    relation approach
  • Limited schema-driven conforming to a fixed doc
    template

11
Getting real data exchange on the Web
  • All members of a community (industry) agree on a
    DTD and then exchange data w.r.t. it
    e-commerce, health-care, ...
  • XML Publishing
  • mapping relational data to XML
  • conforming to the predefined DTD

Web
DTD
XML
XML
Q XML view
DB1
DB2
12
Real-life example insurance company and hospital
  • Daily report
  • Relational database R at the hospital
  • Patient (SSN, name, tname, policy, date)
  • inTreatment (tname, cost)
  • outTreatment (tname, referral)
  • Procedure (tname1, tname2)
  • treatment
  • in hospital composition hierarchy in Procedure
  • outside of the hospital referral

hospital
insurance company
XML view
R
XML
13
Example insurance company and hospital
  • DTD D predefined by the insurance company
  • report ? patient
  • patient ? SSN, pname,
    treatment, policy
  • treatment ? tname, (inTreatment
    outTreatment)
  • inTreatment ? treatment
  • outTreatment ? referral
  • How to define a mapping ? such that for any
    instance DB of R,
  • ? (DB) is an XML document containing all the
    patients and their treatments (hierarchy,
    referral) from DB,
  • ? (DB) conforms to D?

14
Challenge recursive type
  • XML data unbounded depth -- cannot be decided
    statically
  • treatment ? tname, (inTreatment
    outTreatment)
  • inTreatment ? treatment ---
    recursive

SSN
123
15
Challenge non-determinism
  • The choice of a production (element type
    definition)
  • treatment ? tname, (inTreatment outTreatment)
  • -- depends on the underlining relational data

report
16
Existing systems
  • fixed XML tree template or ignoring
    DTD-conformance
  • middleware SilkRoute (ATT), XPERANTO (IBM),
  • systems SQL Server 2000, IBM DB2 XML extender,
  • incapable of coping with a predefined DTD (e.g.
    recursion)
  • type checking define a view and then check its
    conformance
  • undecidable in general, co-NEXPTIME for extremely
    restricted view definitions
  • no guidance on how to define XML views that
    typecheck
  • one gets an XML view that typechecks only after
    repeated failures and with luck

17
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

18
Attribute Translation Grammar (ATG)
  • DTD normalized element type definitions e ? ?
  • ? PCDATA ? e1, , en e1
    en e
  • Attributes e associated with each element type
    e
  • e tuple-valued, to pass data value as well as
    control
  • Rules associated with each e ? ? for e in ?,
    e Q(e)
  • SQL query Q extracts data from DB
  • parent attribute e as a constant parameter in Q

19
Semantics conceptual evaluation
  • Top-down
  • report ? patient
  • patient ? select SSN, name, tname,
    policy
  • from Patient --- SQL
    query
  • recall Patient (SSN, name, tname, policy)
  • Data-driven a patient element for each tuple in
    Patient relation

patient
patient
patient
patient
20
Inherited attributes
  • Inherited child is computed using parent
  • patient ? SSN, name, treatment, policy
  • SSN patient.SSN, name
    patient.name
  • treatment patient.tname policy
    patient.policy
  • recall patient (SSN, name, tname, policy)
  • SSN ? PCDATA
  • PCDATA SSN

report
...
patient
patient
patient
patient
patient
treatment
name
policy
SSN
SSN
123
Joe
LU23
PCDATA
21
Coping with non-determinism
  • treatment ? tname, (inTreatment
    outTreatment)
  • tname treatment
  • (inTreatment, outTreatment)
  • case Qc(treatment).tag
    --- conditional query
  • 1 (treatment, null)
  • else (null, treatment)
  • Qc select 1 as tag from inTreatment where
    tname treatment
  • conditional query the choice of production
  • parent as constant parameter in SQL query

...
treatment
treatment
tname
tname
inTreatment
outTreatment
22
Coping with recursion
  • inTreatment ? treatment
  • treatment ? select tname2
  • from Procedure
  • where inTreatment tname1
  • recall Procedure (tname1, tname2)
  • parent as constant parameter in SQL query Q
  • inTreatment is further expanded as long as Q(DB)
    is nonempty

treatment
treatment
23
DTD-directed publishing with ATGs
  • DTD-directed the XML tree is constructed
    strictly following the productions of a DTD ---
    DTD conformance
  • Data-driven the choice of productions and
    expansion of the XML tree (recursion) depends on
    relational data

report
...
patient
patient
patient
patient
treatment
name
SSN
inTreatment
...
tname
Joe
123
treatment
treatment
...
...
24
ATGs vs. existing systems
  • DTD-conformance
  • ATGs provide guidance for how to define
    DTD-directed publishing
  • Other systems based on a fixed tree template
  • Expressive power strictly more expressive than
    others
  • ATGs capable of expressing XML views supported
    by other systems
  • Other systems cannot handle recursion/nondetermin
    ism
  • ATG the first systematic method to ensure
    DTD-conformance

25
ATGs vs. Attribute Grammars (AGs)
  • AGs
  • Definition w.r.t. a CFG
  • Evaluation parse a string with the CFG, then
    evaluate attributes given the parse tree
  • ATGs combining DTD and database operations
  • Definition w.r.t. an ECFG (DTD) and SQL queries
  • Evaluation given DB, extract relevant data from
    DB with SQL queries to build an XML tree of the
    DTD
  • It does not make sense to parse a database
    w.r.t. a DTD
  • ATGs are not a mild variation of AGs

26
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

27
PRATA middleware based on ATGs
query plan generation, evaluation
relations
ATG graph
XML
ATG
tagging
parsing
cost estimate query
statistics query results
  • R
  • ATG graph representing the ATG
  • relations representing root-leaf paths ---gt
  • tagging one pass

28
Evaluation of ATGs
  • Conceptual evaluation defining semantics, but
    not efficient
  • Techniques proved useful
  • partitioning reduce db visits, make use of DBMS
    optimizer (handling annotated templates
    complete tree)
  • Relations representing XML trees root-leaf paths
  • Separate tagging

29
ATG graph
  • edge labels SQL queries
  • cyclic introduced by recursion -- not in
    other systems
  • iterative unfolding partial ATG tree to a
    depth d

report
Q1

patient
Q5
Q2
policy
Q3
Q4
Q8
name
SSN
treatment
Q7
Qc
tname
outTreatment
inTreatment
Q9
Q6
referral
procedure

30
Partitioning partial ATG trees
  • query composition (outer join)
  • challenge how to partition into clusters?
  • finer ? fewer nulls, smaller queries sent to
    DBMS
  • coarser ? leverage DBMS optimizer, less DB visits
  • Finding an optimal strategy NP-hard

report
composed Q
Q1

patient
Q2
Q5
Q3
SSN
policy
Q4
name
composed Q
treatment
Qc
Q7
outTreatment
tname
inTreatment
Q9
Q6
procedure
referral
...

Q8
treatment
31
Materialization of intermediate results
  • Partitioning root-leaf relation R ?
    intermediate R1, R2, R3, R4
  • repeated computation each Ri carrying a key from
    the root
  • materialization storing keys in temporary
    tables
  • cons overhead with temporary tables
  • pros reducing recomputation in descendant
    queries
  • What keys to materialize? Not supported by other
    systems

output relation R
R1
sort merge
key
R2
R3
sort merge
key
R4
32
Query plan generation
  • New challenge find best partitioning and
    materialization
  • mutually dependent partitioning and
    materialization must be taken together
  • expensive exponentially many choices NP-hard
  • Efficient cost-based heuristic
  • DBMS statistics arity/cardinality of a table,
    cost to evaluate a query/create temporary table
  • cost estimate simple formulae for computing the
    benefit of materialization/partitioning --
    effective in practice
  • optimal strategy -- dynamic programming
    decides partitioning w.r.t. the benefit of
    materialization

33
Experimental evaluation
  • benchmark
  • enhancement of TPC-H part/supplier relational
    data
  • scale factor 0.1
  • system
  • 1.4Gz Pentium IV, 256M memory, Windows 2000
  • DBMS Oracle 9i with a JDBC interface
  • Middleware implemented in Java

34
Experimental result
Execution Time (sec)
35
Summary ATG
  • ATGs
  • ensure DTD-conformance automatically
  • support recursion/non-determinism naturally
  • simple no need for a new language knowing DTD
    SQL is all that one needs for writing an ATG
  • PRATA middleware based on ATGs
  • novel technique of combining partitioning and
    materialization
  • a practical solution for DTD-directed publishing
  • ATG the first systematic way for DTD-directed
    publishing

36
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

37
Data integration in XML
  • multiple, heterogeneous data sources
  • integrated data conforming to a schema DTD
    integrity constraints
  • existing systems the lack of schema-conformance

38
More about hospital and insurance company
  • 3 databases in the hospital
  • Patient (SSN, name, trId, policy, date)
  • Treatment (SSN, trId)
  • Treatment (trId, name)
  • Procedure (trId1, trId2)
  • Billing (trId, cost)
  • Account (SSN, date, amount)

39
Daily report in XML to insurance company
  • Given a particular date, report
  • information of all the patients on that date
  • SSN, name, policy
  • treatments grouped under each patient
  • description of the treatments of these patients
    hierarchy
  • cost of all these treatments
  • Remark
  • Multiple data sources
  • Information passing

40
XML report

date
report
patientInfo
trIdS1
. . .
treatment
patient
patient
. . .
trId
medication
SSN
. . .
trId
trId
DB1
unbounded
41
DTD predefined by the insurance company
  • report ? patientInfo,
    treatmentInfo, costInfo
  • patientInfo ? patient
  • patient ? SSN, pname,
    policy, medication
  • medication ? trId
  • treatmentInfo ? treatment
  • treatment ? trId, tname,
    procedure
  • procedure ? treatment
  • costInfo ? trCost
  • trCost ? trId, cost

42
Integrity constraints
  • Unary keys, foreign keys
  • Key trCost.trId ? trCost
  • each treatment is charged only once
  • Foreign key treatment.trId ? trCost.trId
  • every treatment has cost information reported
  • Why not XML-Schema constraints?
  • Consistency undecidable one cant determine
    whether a schema (DTD constraints) has a
    document instance PODS01
  • Other technical problems DEXA02

43
challenges
  • Multiple data sources
  • Integrity constraints
  • Data dependency controlled derivation
  • synthesized attributes bottom-up
  • inherited attributes top-down
  • set/bag-valued attributes

44
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

45
Attribute Integration Grammar (ATG)
  • Attributes associated with each element type e
  • Syn(e) synthesized from children
  • Inh(e) inherited from parent and siblings
  • Syn(e), Inh(e) tuple or set/bag-valued
  • Rules associated with each e ? ?, for e in ?
  • Syn(Parent) Syn(e) f(Syn(children),
    Inh(e))
  • Inh(child) Inh(e) Q(Inh(e),
    Syn(sibling) )

ATG
DTD


semantic rules
attributes
46
attributes
  • Root
  • Inh(report) (date) -- the parameter of the
    ATG
  • Syn(report) (tKey bag, trIdS set) for
    constraints
  • patientInfo
  • Inh(patientInfo) (date) --- from parent
  • Syn(patientInfo) (trIdS1 set) collection of
    trIds
  • treatmentInfo
  • Inh(treatmentInfo) (trIdS1 set) --- from
    sibling
  • Syn(treatmentInfo) (trIdS2 set) collection
    of trIds
  • costInfo
  • Inh(costInfo) (trIdS2 set) --- from sibling
  • Syn(costInfo) (tKey bag) collection of trIds

47
Dependency relation
  • element type e depends on e
  • iff Inh(e) Q ( Syn(e) )
  • e must be evaluated before e
  • requirement the dependency relation is acyclic
  • report ? patientInfo, treatmentInfo,
    costInfo
  • Inh(treatmentInfo).trIdS1 Syn(patientInfo).tr
    IdS1
  • Inh(costInfo).trIdS2 Syn(treatmentInfo).t
    rIdS2
  • Inh(patientInfo).date Syn(report).date

report
date
patientInfo
treatmentInfo
costInfo
trIdS1
trIdS2
trId
trId
trId
48
Synthesized attributes
  • report ? patientInfo, treatmentInfo,
    costInfo
  • Syn(report).trIdS Syn(treatmentInfo).trId
    S2
  • Syn(report).tKey Syn(costInfo).tKey

report
tKey
trIdS2
patientInfo
treatmentInfo
costInfo
trId
trId
trId
49
Inherited attributes of the first subtree
  • patientInfo ? patient
  • Inh(patient) ? Q1(DB1)
  • select SSN, name, policy
  • from Patient
  • where date Inh(patientInfo).date
  • patient ? SSN, pname, policy, medication
  • Inh(medication) Inh(patient).SSN
  • . . .
  • medication ? trId
  • Inh(trId) ? Q2(DB1)
  • select trId
  • from Treatment
  • where SSN Inh(medication).SSN

patientInfo
. . .
patient
patient
. . .
medication
. . .
trId
trId
DB1
50
Synthesized attributes of the first subtree
  • patientInfo ? patient
  • Syn(patientInfo).trIdS1 ? Syn(patient).trIdS
  • patient ? SSN, pname, policy, medication
  • Syn(patient).trIdS Syn(medication).trIdS
  • medication ? trId
  • Syn(medication).trIdS ? Syn(trId).trId

51
Inherited attributes of the second subtree
  • treatmentInfo ? treatment
  • Inh(treatment) ? Inh(treatmentInfo).trIdS1
  • treatment ? trId, tname, procedure
  • Inh(tname) ? Q3(DB2)
  • select name
  • from Treatment
  • where trId Inh(treatment).trId
  • . . .
  • procedure ? treatment
  • Inh(treatment) ? Q4(DB2)
  • select trId2
  • from Procedure
  • where trId1
    Inh(procedure).trId

treatmentInfo
treatment
treatment
. . .
. . .
procedure
treatment
treatment
. . .
trId
trId
DB2
52
Synthesized attributes of the second subtree
  • treatmentInfo ? treatment
  • Syn(treatmentInfo).trIdS2
  • Inh(treatmentInfo).trIdS1 ? ?Syn(treatment).trI
    dS
  • treatment ? trId, tname, procedure
  • Syn(treatment).trIdS Syn(procedure).trIdS
  • procedure ? treatment
  • Syn(procedure).trIdS ? Syn(treatment).trId

treatment
53
Inherited attributes of the third subtree
  • costInfo ? trCost
  • Inh(trCost) ? Inh(costInfo).trIdS2
  • trCost ? trId, cost
  • Inh(cost) ? Q3(DB2)
  • select cost
  • from Billing
  • where trId Inh(trCost).trId
  • . . .

costInfo
. . .
trCost
trCost
cost
trId
DB3
54
Inherited attributes of the third subtree
  • costInfo ? trCost
  • Syn(costInfo).tKey ? Syn(trCost).tKey
    -- bag union
  • trCost ? trId, cost
  • Syn(trCost).tKey Inh(trCost).trId
    -- cast a bag to a set

55
Controlled derivation
  • data dependency construction of one part of a
    document depends on other parts
  • production refinement n trIdS2
  • costInfo ? trCost ? costInfo ?
    trCostn
  • Data driven ATGregular tree grammar
    ATGDTD

56
Constraint compilation
  • Key trCost.trId ? trCost
  • unique (report.tKey) -- no
    duplicates in the bag
  • Foreign key treatment.trId ? trCost.trId
  • report.trIdS ? set (report.tKey)
  • recall report ? patientInfo,
    treatmentInfo, costInfo
  • Syn(report).trIdS Syn(treatmentInfo).t
    rIdS2
  • Syn(report).tKey Syn(costInfo).tKey
  • Semantic rules and attributes related to
    constraints are automatically generated --
    compilation

57
Summary ATG
  • ATG the first system-directed XML integration
    mechanism
  • integrating multiple data sources
  • capturing both types and constraints in a uniform
    way
  • controlled derivation
  • constraint compilation
  • relationally complete on multiple data sources
  • lightweight no need for explicit wrappers

58
Outline
  • XML publishing
  • DTD-directed publishing VLDB02
  • Attribute Transformation Grammars
  • PRATA middleware based on ATGs
  • Schema-directed integration ongoing work
  • Coping with both type and integrity constraints
  • Attribute Integration Grammar (ATG)
  • Research problems

59
Schema-directed XML to XML transformation
  • Mapping between XML documents
  • XTG XML Transformation Grammar
  • DTD augmented with XQuery
  • Constraint-compilation

XTG
60
Capturing more expressive integrity constraints
  • Expressive constraints multi-attribute keys,
    foreign keys, inverse constraints, and
    non-classical ones PODS02,JACM, JCSS,TOCL
  • Difficulty consistency analyses undecidable
    PODS01

DTD
XML ATG
XML
DB
DB
Web
61
Combining XML query and view
  • Implementing XQeury with SQL and ATG
  • plausible ATG supports recursion
  • motivation XML data is often stored in DBMS
  • Open question
  • what fragment of XQuery ATG SQL?
  • incremental evaluation?

XQuery
XML ATG
SQL
data
DB
62
Updating XML via relational DBMS
update
  • mapping updates to the underlining DBMS via ATG
  • motivation making use of DBMS transaction
    control
  • incremental maintenance
  • constraint propagation XML constraints --gt
    relations
  • normalizing the relational storage of XML data
  • consistency maintenance

DTD
XML ATG
SQL
data
functional dependencies
DB
63
Ontology integration
  • Ontology a combination of
  • definition hierarchy
  • inheritance hierarchy
  • Integration
  • schema mapping information preservation
  • data mapping
  • Applications
  • semantic Web
  • scientific databases

64
Summary
  • ATG the first DTD-directed publishing mechanism
  • ATG the first system-directed XML integration
    mechanism
  • Research topics
  • many related questions remain open
  • these problems are hard yet important
  • the idea of ATG/ATG shed light on solving these
    problems
  • ? research projects, papers, systems, . . .
Write a Comment
User Comments (0)
About PowerShow.com