Title: Diapositive 1
1Designing and Bulding warehouses in XML
Omar Boussaid Riadh Ben Messaoud Rémy
Choquet Stéphane Anthoard Laboratory ERIC,
University Lyon 2 Campus Porte des Alpes, 69676
Bron Cedex Omar.Boussaid_at_univ-lyon2.fr -
rbenmessaoud_at_ericuniv-lyon2.fr remy.choquestepha
nea_at_gmail.com http//eric.univ-lyon2.fr/
ADBIS 2006 - Thessaloniki, Hellas
2Motivation
Complex data different format, different
supports
Example case study of a patient (general
information on the patient age, sex, etc,
images of scanner, the interrogations in the form
of recordings sound report handwritten doctors)
? It is necessary to structure data and to
homogenize them
? XML semi-structured organization of data.
Its capacity of self-description and its tree
structure give to this formalism a great
flexibility and a sufficient power to describe
complex data, heterogeneous and distributed
3Motivation
4 Outline
- Motivation
- Related work
- Our approach X-Warehousing
- Formalization
- Construction of an XML cube
- Implementation
- Case study
- Conclusion and future directions
5Related work
- Baril et Bellahsène (2000) Dawax, View Manager
- Pokorny (2001) XML Stars schema
- Golfarelli et al. (2001) Dimensional model of
Facts trees of attributes
- Hümmer et al. (2003) Xcube
- Rajugan et al. (2003) use of packages UML
- Trujillo et al. (2004) directed approach object
- Nassis et al. (2004) approach OO , repository
xFacts et Virtual dimensions
- Rusu et al. (2005) Warehouse XML
- Park et al. (2005) XML-OLAP
6Related work
- Two different approaches
- 1) Physical storage of XML documents in DW.
- XML documents feed the DW.
- XML is regarded as an effective technology
supporting data. - Data are sligthly structured, adapted to the
interworking, and to the exchange of information. - 2) Use of the XML formalism to design DW.
- ? According to the traditionnal multidimensional
models such as the star schema or the snowflake
schema.
7 Approach X-Warehousing
Context general of our approach
pierre.jouve_at_eric.univ-lyon2.fr
8Formalization
- Définition Star XML diagram
- That is to say (F,D) a star diagram, where
- F is a whole of facts having m measures F.Mq, 1
q m and - D Ds, 1 s r a whole of r dimension
where each Ds contains a whole of ns attributes
Ds.Ai, 1 i ns. - Le Star XML diagram of (F,D) is a diagram XML
such as - F defines the element root in diagram XML
- ? q ? 1, . . . ,m, F.Mq an attribute XML
included in the element root defines - ?s ? 1, . . . , r, Ds is a under elements XML
of the element root XML. There is as many of
under elements XML that of size related to the
unit the facts F - ?s ? 1, . . . , r et ? i ? 1, . . . , ns,
Ds.Ai an attribute XML included in element XML Ds
defines.
9Description of a cube by an XML diagram
ltxselement name"F"gt ltxscomplexTypegt
ltxssequencegt ltxselement
name"D1" type"D1_Type" /gt
ltxselement name"D2" type"D2_Type" /gt
ltxselement name"D3" type"D3_Type" /gt
ltxselement name"D3" type"D3_Type"
/gt ltxselement name"D4"
type"D4_Type" /gt lt/xssequencegt
ltxsattribute name"F.M1" type"xsinteger" /gt
ltxsattribute name"F.M2"
type"xsinteger" /gt lt/xscomplexTypegt lt/xs
elementgt ltxscomplexType name"D1_Type"gt
ltxsattribute name"D1.A1" type"xsstring"
/gt lt/xscomplexTypegt ltxscomplexType
name"D2_Type"gt ltxsattribute name"D2.A1"
type"xsstring" /gt ltxsattribute
name"D2.A2" type"xsstring" /gt lt/xscomplexTypegt
ltxscomplexType name"D3_Type"gt
ltxsattribute name"D3.A1" type"xsstring" /gt
ltxsattribute name"D3.A2" type"xsstring"
/gt lt/xscomplexTypegt ltxscomplexType
name"D4_Type"gt ltxsattribute name"D4.A1"
type"xsstring" /gt lt/xscomplexTypegt
10Formalization
Definition Hierarchical Dimension XML That it
to say H D1, . . . ,Dt, . . . ,Dl a
hierarchical dimension. The hierarchical
dimension XML is part of a diagram XML such as
D1 an element XML defines ? ? t ? 2, . .
. , l , Dt defines a under element XML of
element XML Dt-1 ? t ? 1, . . . , l , each
attribute in Dt defines an attribute XML included
in element XML Dt.
Définition Model in snowflakes XML That is to
say (F,H), a model in snowflakes where F is a
whole of facts having m measures F.Mq, q
m et H Hs, s r is a whole of r
independent hierarchies. The model in snowlfakes
XML de (F,H) is a diagram XML such as F
définit lélément XML racine du schéma XML ?q
? 1, . . . ,m, F.Mq element XML root of diagram
XML defines ?s ? 1, . . . , r, Hs as many
time of hierarchical dimensions XML, like under
element XML root which it is related to the whole
of facts F.
11Example of an XML fact
lt?xml version"1.0" encoding"UTF-8"
?gt ltSuspicious_region Region_length"287"
Number_of_regions"6"gt ltPatient
Patient_age"60" gt ltAge_class
Age_class_name"Between 60 and 69 years old" /gt
lt/Patientgt ltLesion_type
Lesion_type_name"calcification type
round_and_regular distribution n/a"gt
ltLesion_category Lesion_category_name"calcificati
on type round_and_regular" /gt
lt/Lesion_typegt ltAssessment
Assessment_code"2" /gt ltSubtlety
Subtlety_code"4" /gt ltPathology
Pathology_name"benign_without_callback" /gt
ltDate_of_study Date"1998-06-04"gt
ltDay Day_name"June 4, 1998"gt
ltMonth Month_name"June, 1998"gt
ltYear Year_name"1998" /gt
lt/Monthgt lt/Daygt
lt/Date_of_studygt ltDate_of_digitization
Date"1998-07-20"gt ltDay
Day_name"July 20, 1998"gt
ltMonth Month_name"July, 1998"gt
ltYear Year_name"1998" /gt
lt/Monthgt lt/Daygt
lt/Date_of_digitizationgt ltDigitizer
Digitizer_name"lumisys laser" /gt
ltScanner_image Scanner_file_name"B_3162_1.RIGHT_C
C.LJPEG" /gt lt/Suspicious_regiongt
12Construction of the XML cubes
? MCA needs for analyses ? XML documents
? Algorithms to merge attribute trees based on
1. fusion per pruning 2. fusion per
grafting
Concept of attribute tree (Golfarelli and al.
1998, Golfarelli er Rizzi 1999, Golfarelli and
al. 2001
13Construction of the XML cubes
Fusion of the trees of attributes
MCA needs for analyses ? attribute tree XML
documents ? attribute tree
? Diagram XML of a cube XML
- Operations of fusion of the attribute trees
- Concept of minimal contents
14Construction of the XML cubes
- Fusion of attribute trees
15Construction of the XML cubes
- Minimal content of an XML document
XML documents must contain sufficient information
to meet the needs for analysis of the user
control on the attribute tree.
The user defines the elements (measurements,
dimension, hiérarchis and their attributes) in
the MCA necessary or not (mandatory or optional)
for his objectives of analysis.
The minimal contents of an XML document thus
correspond to the mandatory part of the attribute
tree associated to the MCA.
16Implementation
17Implementation
Function WriteTreeDeep(document,tree)
rootGetRootElement(document)
nodeListGetNodes(tree,root) While
Not(end(nodeList))
Graphe.AddVertex(nodeList.name)
Call Function ReadTreeDeep(nodeList.name,tree)
End While End Function
Function ReadTreeDeep(root,tree)
nodeListGetNodes(tree,root) While
Not(end(nodeList)) Graphe.AddVertex(node
List.name) Call Function
ReadTreeDeep(nodeList.name,tree) End
While End Function
- Recursive functions WriteTreeDeep and
ReadTreeDeep to handle the attribute tree
18Implementation
Function MergeTree(tree1,tree2)
tree3DuplicateTree(tree1) While
Not(end(nodeList(tree3)))
vertex1GetVertex(tree3) While
Not(end(nodeList(tree2)))
vertex2GetVertex(tree2)
If vertex2vertex1 Then vertex1.arc 0
End While End While
Tree3WriteTree(tree3) End Function
- Function MergeTree to amalgamate two trees of
attributes
19 Case study Context
DDSM (Digital Database for Screening
Mammography) a complex DB (2 604 files of
patients A total volume of 230,9 Go)
- A file is composed of
- 1 file .ics describing in ASCII format, general
informations of a file of patient. - 4 files images .LJPEG (LOSSLESS JPEG) of the
digitized radios. - Each radio presents an angle of sight of the
centre Left_CC, Left_MLO, Right_CC, Right_MLO
(CC Cranio-Caudal MLO Medio-Latral Oblique). - For each radio operator presenting one or of the
abnormal zones, is assocated a file .OVERLAY in
ASCII format, describing an anomaly of the
centre. - 1 file image .16_PGM gathering the 4 radios and
presenting a fast outline for the visualization
of a file of patient.
20 Case study Context
21 Case study Corpus XML
Documents XML (http //eric.univ-lyon2.fr/rbenmess
aoud/ ?pagedonneessection3)
22 Case study Conceptual model of the needs
Case of the Suspects areas
23Case study attribute trees
- Tree of attributes associated with the MCA with
Suspects areas.
- Tree of attributes of documents XML in entry
24Case study Logical model of the XML cube
.Diagram XML of cube Suspects areas
lt?xml version1.0 encodingUTF-8
?gt ltxsschema xmlnshttp//www.w3schools.comgt ltx
selement nameSuspicious regiongt
ltxscomplexTypegt ltxssequencegt
ltxselement namePatient typePatient Type
/gt ltxselement nameLesion Type
typeLesion Type /gt ltxselement
nameSubtlety typeSubtlety Type /gt
ltxselement namePathology typePathology
Type /gt ltxselement nameDate of
study typeDate Type /gt ltxselement
nameDate of digitization typeDate Type /gt
ltxselement nameDigitizer
typeDigitizer Type /gt ltxselement
nameScanner image typeScanner Type /gt
lt/xssequencegt ltxsattribute nameRegion
length typexsinteger /gt ltxsattribute
nameNumber of regions typexsinteger /gt
lt/xscomplexTypegt lt/xselementgt ltxscomplexType
namePatient Typegt ltxssequencegt
ltxselement nameAge classgt
ltxscomplexTypegt ltxsattribute
nameAge class name typexsstring/gt
lt/xscomplexTypegt lt/xselementgt
lt/xssequencegt
25Case study Logical model of the XML cube
.. ltxscomplexType nameLesion Type Typegt
ltxssequencegt ltxselement nameLesion
categorygt ltxscomplexTypegt
ltxsattribute nameLesion category name
typexsstring/gt lt/xscomplexTypegt
lt/xselementgt lt/xssequencegt
ltxsattribute nameLesion type name
typexsstring/gt lt/xscomplexTypegt ltxscomplexTy
pe nameSubtlety Typegt ltxsattribute
nameSubtlety code typexsinteger/gt lt/xscomp
lexTypegt ltxscomplexType namePathology Typegt
ltxsattribute namePathology name
typexsstring/gt lt/xscomplexTypegt ltxscomplexT
ype nameDigitizer Typegt ltxsattribute
nameDigitizer name typexsstring/gt
lt/xscomplexTypegt ltxscomplexType nameScanner
Typegt ltxsattribute nameScanner file name
typexsstring/gt lt/xscomplexTypegt ..
26Case study Logical model of the XML cube
. Diagram XML of cube Suspects areas
ltxscomplexType nameDate Typegt
ltxssequencegt ltxselement nameDaygt
ltxscomplexTypegt ltxssequencegt
ltxselement nameMonthgt
ltxscomplexTypegt
ltxssequencegt ltxselement
nameYeargt
ltxscomplexTypegt
ltxsattribute nameYear name typexsinteger/gt
lt/xscomplexTypegt
lt/xselementgt
lt/xssequencegt
ltxsattribute nameMonth name
typexsstring/gt
lt/xscomplexTypegt lt/xselementgt
lt/xssequencegt ltxsattribute
nameDay name typexsstring/gt
lt/xscomplexTypegt lt/xselementgt
lt/xssequencegt ltxsattribute nameDate
typexsdate/gt lt/xscomplexTypegt lt/xsschemagt
27Conclusion and future directions
Conclusion
? methodology based on the XML formalism to store
complex data.
- To express a level of abstraction interesting to
prepare data to analysis. - To feed a multidimensional structure using XML
documents.
- A formalization of the star schema or the
snowflake schema in XML. - ( use of the tree of attributes, Golfarelli and
al., 2001a,b)
- A Java application which produces a logical model
and a phisical model of a cube from heterogenious
XML documents
- A case study on suspect areas on mammographies
showed the interest of our approach.
28Conclusion and future directions
Future directions
- Interrogation of the XML cube an extension of
the XQuery language is necessary to make it
possible and to carry out the operation of
Group-by.
- Not numerical measurements resort to suitable
operators. - The example of the operator OpAC ( Ben
Messaoud and al., 2004),
- A study of performance is requested within the
framework of XML cubes
- Problem of update of XML cubes when changes in
data sources are needed
- Physical model of the XML cube