Title: XML Data Quality Modeling
1XML Data Quality Modeling
- Monica Scannapieco
- Dipartimento di Informatica e Sistemistica
- Università di Roma La Sapienza
2Outline
- Motivations
- The D2Q data model
- Querying the D2Q data model
- DaQuinCIS architecture Mediator
- Conclusions
3Data Quality Multi-dimensional concept
- Accuracy
- Jhn vs. John
- Currency
- Residence Address out-dated vs. up-to-dated
- Consistency
- ZIP Code and City consistent
Completeness
4 Data Quality CISs
- Cooperative Information Systems (CISs)
- data sharing to accomplish cooperative tasks
- high data replication
Instance level heterogeneities need to be
reconciled
CISs need data quality
High data replication
Data quality needs CISs
5Why DQ modeling?
- Quality can be associated to data in order to
- Certify the correcteness (accuracy,
consistency, currency) and completeness of data - benefits for cooperation
- Support instance level reconciliation
- last-update timestamp for currency guarantee
drives the reconciliation phase
6Why XML modeling?
- The context of CISs obliges to face
interoperability issues - Flexibility of semi-structured models
- possibility of associating quality values to
different data granularity levels
7D2Q Data and Data Quality Model
- Graph-based data model, enhancing the semantics
of the XML data model to represent quality data
8Data Schema
- Data class d(named, p1,, pn)
- Name named
- Set of properties pi ltnamei Typeigt where
- namei is the name of the property pi
- Typei can be
- (i) a basic type
- (ii) a data class or
- (iii) a a type set-of ltXgt, where ltXgt can be
either a basic type or a data class -
- Data Schema
- Node- and Edge-Labelled Direct Acyclic Graph of
data classes -
9Quality Schema
- Quality Class ?d associated to a data class d
Enterprise_Currency
Enterprise_Currency
Enterprise_QualityEnterpriseQualityClass
Enterprise_QualityEnterpriseQualityClass
t
t
_currency
_currency
Enterprise_Consistency
Enterprise_Consistency
t
t
_consistency
_consistency
Owner_QualityOwnerQualityClass
Owner_QualityOwnerQualityClass
Enterprise_Completeness
Enterprise_Completeness
t
t
_completeness
_completeness
1
1
Enterprise_Accuracy
Enterprise_Accuracy
t
t
_accuracy
_accuracy
FiscalCode_Quality
FiscalCode_Quality
FiscalCodeQualityClass
FiscalCodeQualityClass
t
t _accuracy
Owner_Accuracy
Owner_Accuracy
accuracy
FiscalCode_Accuracy
FiscalCode_Accuracy
FiscalCode_Accuracy
FiscalCode_Accuracy
FiscalCode_Accuracy
FiscalCode_Accuracy
FiscalCode_Currency
FiscalCode_Currency
t
t
t
t
_accuracy
_completeness
_accuracy
_completeness
t
t
_consistency
_consistency
t
t
_currency
_currency
10D2Q Schema
Quality Association
quality
Enterprise
string
Enterprise_Quality
Owner
string
quality
Code
Name
quality
Owner_Quality
Code_ Quality
?-accuracy
Name_ Quality
quality
?-accuracy
Enterprise_ accuracy
?-accuracy
Code_ accuracy
Name_ accuracy
Quality Associations Biunivocal functions among
all nodes of a data schema and all non-leaf
nodes of a quality schema
11D2Q Schema Instances
- Data Classes Instance-gt Data Objects
- Quality Classes Instance-gt Quality Objects
- Quality Association Values-gtQuality Links
Owner1
FiscalCodeSCNMNCXXX
AddressVia Salaria 113 Roma
NameMonicaScannapieco
12From D2Q to XML
- D2Q schemas translated into XML Schemas
- XML Schema Types Definition
- Data and quality classes and their properties as
XML elements - OID and QOID attributes for quality associations
- Introduction of root elements
13Querying the D2Q Model with XQuery
- Quality Selectors set of user-defined XQuery
functions - Each quality selector allows to access the values
of a specific dimension or the overall quality of
a set of input nodes - accuracy(node)-gtnode
14Example of DQ Accessing
- Query
- for i in input()//ownerName eq Monica
Scannapieco - return quality(i/Address,i/FiscalCode)
- Result
- ltrootgt
- ltAddress_Quality qOIDqOID132gt
- ltAddress_Accuracygthighlt/Address_Accuracygt
- ltAddress_Currencygtmediumlt/Address_Currencygt
- lt/Address_Qualitygt
- ltFiscalCode_Quality qOIDqOID131gt
- ltFiscalCode_Accuracygthigh lt/FiscalCode_Accuracy
gt - lt/FiscalCode_Qualitygt
- lt/rootgt
15DaQuinCIS Platform A platform for exchanging and
improving data quality in CISs
Quality Factory
Quality Notification Service
- Data Quality Broker
- Record Matcher
16Quality Improvement Strategy
The Broker selects the best quality data
answering a query and sends it to the requester
(data quality-driven query answering) and to
other providers (On-Line Improvement)
The Record Matcher periodically compares exported
data in order to improve their quality
Broker
Cooperative data
Cooperative data
Cooperative data
The notification service multicasts data quality
changes
Quality Maintenance
17Data Quality Broker a Quality-based Data
Integration System
- Wrapper/Mediator Architecture
- Global and Local views expressed as XML Schemas
D2Q-compliant - Global as View (GAV) Mapping
18Query Processing Steps (1)
- Given a query Q on a D2Q global schema
- Q is unfolded according to a static mapping that
retrieves all copies of same data that are
available in the CIS - The execution of local queries returns a set of
results, on which a run-time matching is performed
19Query Processing Steps (2)
- The result to be returned is built as follows
- (i) if no quality requirement is specified, a
best quality default semantics is adopted. This
means that the result is constructed by selecting
the best quality values - (ii) if quality requirements are specified, the
result is constructed by checking the
satisfiability of the requirements on the whole
result.
20Why such a semantics?
- Best quality copies always available
- Quality Improvement Feature
- Results collected at query-time have an
associated quality - The best quality results is proposed to all data
sources that provided a lower quality result
21How does it work?
- Static mapping specified through path expressions
- A path expression allows to locate a concept in a
schema - XML schemas are D2Q compliant
22Mediator query processing steps
23Unfolding
- Path Expression Extraction a global query is
analyzed to extract path expressions - Path Expression Pre-processing to obtain
completely specified path expressions - Translation each path expression on the global
view is translated into (a set of) path
expressions over the structure of the local views - Framing keeps trace of transformation steps
- Queries over local sources are sent to the
Transport Engine module
24Refolding
- Re-translation step received results are
re-translated according to the global schema
specification - Materialization results are concatenated into a
single, temporary file - Global Query Execution the global query is
changed into a query using only local files, and
can then be executed - Record Matching records are matched and after a
comparison on quality values externally made by
the Comparator Module, the query results ordered
by quality are sent for a Quality Filtering - Results best fitting with the user query
requirements are sent back to the user. Moreover,
quality feedbacks are sent to the Transport
Engine that is in charge of propagating them in
the system.
25Implementation Modules
26On-going Experiments
- Currently comparing
- Periodical record matching in a traditional
setting - Quality Improvement strategy underlying DaQuinCIS
- Three Italian PA databases are being used
27Conclusions
- An XML Data Quality model
- How using it within a DI architecture for
- Quality accessing
- Quality improving
28DaQuinCIS Platform
- Data Quality in Cooperative Information Systems
(DaQuinCIS) platform for exchanging and
improving data quality in CISs
Rating
Service
Communication
Infrastructure
Quality
Notification
Quality
Service
(QNS)
Factory (QF)
Cooperative
Cooperative
Gateway
Gateway
OrgN
back
-
end systems
Cooperative
internals
of the
Org2
Org1
Gateway
organization
Data
Quality
Broker (DQB)