Title: DIaD
1DIaD Data Integration and Dissemination
May 2009 James.Reid_at_ed.ac.uk
Data Integration and Dissemination DIaD
2Background
Who? EDINA a JISC funded National Data Centre
delivering on-line resources to UK Higher and
Further Education The ESRC's Geography Data Unit
for the Census Programme
What? DIaD an ESRC funded project aimed at
exploring innovation in census delivery
mechanisms The primary objective of this work is
to develop a data dissemination model which
demonstrates a more generic capability that of
geo-linking
3Background - What?
- The secondary objective of the work was to
develop value - added services exploiting the results of the
automated - linkage outputs, specifically
- A cartogram service
- A bitorrent based network for dissemination of
the - Linked Outputs
- More on these later...first the rationale...
4Background - Why?
- The two most heavily used of the data sources are
the small area statistics - provided by the Census Dissemination Unit (CDU)
and the digital boundary - datasets provided by the Geography Data Unit
(UKBORDERS). Together - these sources allow end users (significantly
researchers) to undertake a wide - range of analytical and visualisation tasks, from
for example, simple - choropleth mapping to cartogram transformations
to detailed small area - spatial analyses.
- Each resource (the statistics on one hand and the
boundary data on the other), - are extremely valuable in their own right but in
combination they provide a - data resource of almost unparalleled versatility
and richness to social science - investigators
5Background - why?
Evidence from a recent ESRC survey of geospatial
services and requirements Source ESRC interim
survey results, march 2009. n512
6How?
- Via Open Standards (a la Open Geospatial
Consortia)? - Specifically using
- the Geographic Linkage Service (GLS)
Specification - the Web Feature Service (WFS) Specification
- investigate the Web Processing Service (WPS)
Specification - Implicitly via use of Open Source Software
- An open standard is a standard that is publicly
available and has various rights to use
associated with it, and may - also have various properties of how it was
designed (e.g. open process).
7How?
Geographic Linkage Service (GLS)
Specification Purpose to provide a simple way
to describe and exchange data that contains
geographically related information, but which
does not include the detailed geometry of the
geographic object. A GLS provides a simple
standardized way to exchange attribute
information that applies to a well-known
geospatial dataset known as a Framework
dataset. Attribute information delivered from a
GLS can be used in a variety of ways, including
use by models to perform calculations, or
visualization as a web map.
8How?
Geographic Linkage Service (GLS)
Specification GLS includes two related sets of
operations. 1. GetData - Attribute data is
provided to other computers on the network by
implementing the GetData (and related)
operations. The response to a GetData
operation is an XML file, in a format known as
GDAS (Geographic Data Attribute Set). 2.
JoinData - At some other node on the network,
another GLS configured for the JoinData
operation allows a computer to incorporate the
contents of the XML file into a local spatial
framework dataset. This local dataset would
normally in turn be used to support mapping of
this information. In early versions of the
GLS the specification was split into two separate
specifications, one of which was known as the
Geographic Data Access Service (GDAS).
Subsequent revision integrated the two
specifications. Note that GDAS (original) and
GDAS (current) are not the same thing!
9GML vs GDAS
- In comparison to GML, GDAS provides the following
specific benefits - It is a single logical encoding for attribute
data. - It is extremely light-weight.
- It is optimized for the efficient discovery of
vector attributes. - It includes attributes to support automated
mapping, including titles, legends, and - the classification of attributes.
- It includes attributes to address the presence
of null values in the dataset to - facilitate their exclusion from calculations
and legends. - It includes attributes to support the joining
of tabular data to geometry in a N1 or - NN fashion.
- It is easy to validate its content and convert
it into HTML or other formats. - It is easy to manipulate its content and
enables the performance of calculations - using XSLT.
- It is easy to generate directly from corporate
database management technology, - using languages such as XQuery.
10GLS Operations in more detail
GetData - the GetData operation returns an
entire GDAS file including attribute data and
its associated metadata. The related
operations, - DescribeFrameworks -
DescribeDatasets and - DescribeData
return selectively larger portions of the
metadata for the geographic attributes that
can be served up by the GLS instance.
11GLS Operations in more detail
JoinData - the JoinData operation joins
attribute data in GDAS format to its
spatial framework and deliver references
to the joined output. -
DescribeJoinAbilities operation returns a list of
the spatial frameworks that are
available to the service - DescribeKey
operation lists the spatial identifiers.
12GDAS in more detail
- The GDAS format is designed to support simple
as well as rich - and complicated attribute databases that may not
always be easy - to interpret.
- The metadata included in the encoding is
designed to - ensure that the user knows exactly what the
content of the dataset is - as well as which spatial framework it
references, and has easy access - to any associated documentation.
- GDAS is produced in response to a GLS GetData
request -
- The general structure of the GDAS XML encoding is
as follows. - ltGDASgt
- ltFrameworkgt
- ... spatial framework
metadata - ltDatasetgt
- ... attribute dataset
metadata - ltAttributegt
13Value Added Services (1) Cartograms?
A Cartogram Generation Service Cartograms
represent map feature surfaces in such a way,
as to make them proportional to a given
statistical variable. This representation
method mostly derives from "classical" maps
(i.e., maps representing ground topography) in
the sense that the transformation can only be
processed on an already given geometry.
Topographical polygon layers are thus mostly
used as a starting point for the production of
any cartogram. - ScapeToad In reality it
looks more like this...
- Uses the excellent Scape Toad code at the backend
to generate and output Cartograms - (chorogram.choros.ch/scapetoad)?
- Uses the Gastner/Newman algorithm
-
14Value Added Services (1) Cartograms?
We have developed a simple Cartogram Generation
Service which takes a number of parameters, some
of which have default values. We've mimicked
these using the ScapeToad's API. layerengland_o
a_2001 attributepopulation attrTypemassdensit
y (A mass (e.g. a population or a wealth) is
measured or estimated over the whole surface
of each polygon a density can be a
massmass ratio or a masssurface ration)? url
(example - must be encoded)? http//diad.edina.
ac.uk/service/joinedData?datasetdataset_name
quality50 gridtrue rows100 http//a.webs
ite.ac.uk/service/cartogram? layerengland_oa_2001
attributeks0080001urlhttp//anothersite.ac.uk/
test_1240418191235.zip
15Value Added Services (1) Cartograms?
Worldmapper example Age of Death?http//www.wor
ldmapper.org/
DIaD generated, Deprivation(ONS)Income scores,
Swindon
16Value Added Services (2) Bittorrent?
- A peer-to-peer file sharing protocol used for
distributing large amounts of data. - BitTorrent is one of the most common
protocols for transferring large files, and - by some estimates it accounts for about 35
of all traffic on the entire Internet. - The protocol works initially when a file
provider makes his file (or group of files)? - available to the network. This is called a
seed and allows others, named peers, to - connect and download the file. Each peer
that downloads a part of the data makes - it available to other peers to download.
After the file is successfully downloaded by - a peer, many continue to make the data
available, becoming additional seeds. - This distributed nature of BitTorrent leads to
a viral spreading of a file throughout - peers. As more seeds get added, the
likelihood of a successful connection increases - exponentially. Relative to standard Internet
hosting, this provides a significant - reduction in the original distributor's
hardware and bandwidth resource costs. - Provides redundancy against system problems
and reduces dependence - on the original distributor.
17Value Added Services (2) Bittorrent?
- A Bittorent creation service will be added to
the linked outputs and a tracker - established to allow geolinked reasults to be
downloaded via a p2p client - e.g uTorrent
- Rationale increasingly researchers/students
want access to resources from home - Home machines tend to have lower bandwidth than
those directly available from - SuperJANET backbone
- So the Bittorent approach means users can
'share the load' of large files - Note that boundary data statistics data can
quite lareg files sizes (Gb vs Mb)?
18Whither open source?
- Our demo client uses a front and back-end stack
of OSS - Openlayers
- Postgis
- Geoserver
- OGR
- ScapeToad
- Our own code for the GLS is (will be) opensource
19General Observations
- Open standards (and OSS) have a definite role
but... - They are not an end in themselves
- They are not always as mature (or static) as
you might wish - Things evolve - often in short time periods
- Users (!)?
- Interoperability (the holy grail) is possible
but there are significant - barriers
- AA issues (UKAMF web services - GeoXACML?)?
- Scalability (Cloud/Grid ??)?
- Evolving delivery paradigms e.g. mobile
- User expectations vs resourcing constraints
20DIaD in progress..
21http//devel.edina.ac.uk8080/diad/diad.html