Title: Metadata and XML for Organizing and Accessing Multiple Statistical Data Sources
1Metadata and XML for Organizing and Accessing
Multiple Statistical Data Sources
- Yaxin Bi(1) , Fionn Murtagh(2), and Sally
McClean(1) - (1)University of Ulster, N Ireland
- (2)The Queens University of Belfast, N Ireland
2Introduction
- A brief background of this work
- Major issues related to an integration of
multiple statistical data sources - A proposed approach for querying multiple
statistical data sources - The roles of statistical metadata and XML in
structuring and organizing statistical data
sources - Introduction to a content model DTD, called
Statistical Metadata Description Language (SMDL)
3Introduction (continued)
- Conclusion
- Implementation of a prototype
4The issues in the integration of multiple
statistical data sources
- legacy system problem
- Heterogeneity
- Access to multiple data sources
- Sharing a common view among multiple data sources
5A methodology used in this work
- Enhancing the conceptual representation to
multiple data sources - Considering relations among data sources
- Extracting metadata, building up a content model
serving as a global ontology for describing
multiple data sources - Accessing multiple data sources is divided into
two step - Searching, navigating
- Database query
6Querying on multiple sources
Query
Client
Database1
Results
Database2
7Expressing a database query on multiple data
sources
- SQL select-from-where construct as a basis
- Clear information need
- Unclear data sources
- what the databases contain?
- what the variable names can be used for querying?
- where the databases are?
- Aiming at how to make unclear to be clear
8Conventional approach to querying multiple data
sources
- An integrated interface global view for multiple
data sources - limited and
- it is impossible to represent context
- Universal schema
- ensure mapping between schemas are semantically
equivalent but whose forms are different
9Two step of query on multiple sources
Searching Navigating Exploring
Query
Client
Database1
Results
Database2
XML data (metadata)
10From unclear query to clear query
Vague expression Query
MD
ltDomain / Provider / Dataset / Variable Name /
Descriptiongt
. . .
Clear expression Query
DB
DB
11Ranking and validating pathway
- Ranking the retrieved results
- Generating a pathway ltDomain / Provider / Data
sources / Variable Name / Descriptiongt - Validating the pathway for database query
12An architecture for ADDSIA
13A proposed framework for statistical metadata and
XML
- Creating a content model DTD with a form
ltidentifiergtvaluelt/identifiergt - Bringing data representation, structure and
relationship together. - Describing the content of data sources and
relations with external sources - Crossing platforms
14Scope of statistical metadata
- Description of methods of statistical data
processing and management - Description of statistical data sources
- Definition of classification of activities and
concept - Definition of statistical population and
measurement units - Definition of variables
- Definition of mapping
15Statistical metadata domain metadata
- Domain metadata. What information is covered by
the data sources - domain name
- domain manager
- domain institution
- description
- concept
- data provider
- source
16Statistical metadata data provider metadata
- Data provider metadata. Details about the data
sources with a hierarchical structure - provider name
- description
- provider manager
- data source
- dataset
17Statistical metadata data provider metadata
(continued)
- Dataset
- name survey details
- timeperiod time series
- content codelist
- reference variable
- source
- geographical area
- note
18Statistical Metadata Description Language (SMDL)
ADDSIA DTD
lt!ELEMENT SMDL (DOMAIN, DATAPROVIDER)gt lt!ELEMENT
DOMAIN (DMNAME, DMDESCRIPTION, DMINSTITUTION,
DMMANAGER, DMPROVIDER, CONCEPT,
CLASSIFICATION, UNITDESCR, DMSOURCE)gt lt!ELE
MENT DMNAME (PCDATA)gt lt!ELEMENT DMDESCRIPTION (
PCDATA)gt lt!ELEMENT DMINSTITUTION (PCDATA)gt lt!ELE
MENT DMMANAGER (NAME, EMAIL, TELEPHONE?, FAX?,
ADDRESS?)gt lt!ELEMENT NAME (PCDATA)gt lt!ELEMENT EM
AIL (PCDATA)gt lt!ELEMENT TELEPHONE (PCDATA)gt lt!E
LEMENT FAX (PCDATA)gt lt!ELEMENT ADDRESS (PCDATA
)gt lt!ELEMENT DMPROVIDER (INSTITUTION, COUNTRY,
URL?)gt lt!ELEMENT INSTITUTION (PCDATA)gt lt!ELEMENT
COUNTRY (PCDATA)gt lt!ELEMENT URL (PCDATA)gt lt!E
LEMENT CONCEPT (NAME, DESCRIPTION, DEFINITION?,
REFERENCE, STATUS)gt
19SMDL tree structure
20An example marked up a labour force domain
lt?xml version"1.0"?gt lt!DOCTYPE SMDL SYSTEM
"smdl_v_1.dtd"gt lt!-- Define a domain
--gt ltSMDLgt ltDOMAINgt ltDM_NAMEgt Labour Force
lt/DM_NAMEgt ltDM_DESCRIPTIONgt surveys and data
sets studying the availability
of labour. See xxx lt/DM_DESCRIPTIONgt
ltDM_INSTITUTIONgt ONS lt/DM_INSTITUTIONgt
ltDM_MANAGERgt ltNAMEgt A.N.Other lt/NAMEgt
ltEMAILgt A.Other_at_ons.gov.uk lt/EMAILgt
ltTELEPHONEgt ext 1234 lt/TELEPHONEgt
lt/DM_MANAGERgt ltDM_PROVIDERgt ltINSTITUTIONgt
ONS lt/INSTITUTIONgt ltCOUNTRYgt UK lt/COUNTRYgt
lt/DM_PROVIDERgt ltDM_PROVIDERgt ltINSTITUTIONgt
CSO lt/INSTITUTIONgt ltCOUNTRYgt Irelandlt/COUNTRYgt
lt/DM_PROVIDERgt
21Conclusion
- Avoiding heterogeneity to some extent
- The framework with XML format (field-based),
hierarchical structure and link-following
relationship - Providing an approach for scanning the content of
multiple data sources through searching,
navigating, and performing database query - There is significant overhead from delimiters in
XML-data
22An architecture for searching and navigating
statistical metadata
Browser
Client side
Navigating
Searching
HTTP/RMI
Domain server
XML data (metadata)
23Retrieving metadata using searching engine
24Navigating hierarchical structure
25Exploring dataset content for database query
26Read and edit XML data