Title: Noriel Christopher C' Tiglao, Dr' Eng
1Spatial Data Representation and Analysis
Module 3
- Noriel Christopher C. Tiglao, Dr. Eng
- 24 January 4 February 2005
- Statistical Research and Training Center (SRTC)
- Quezon City, Metro Manila
2Presentation Outline
- Introduction
- Spatial Data and Spatial Relationships
- Sampling Reality
- Scales of Measurement
- Data Sources and Errors
- Data Abstraction
- Spatial Data Structures
3Introduction
- The world is infinitely complex
- The contents of a spatial database represent a
particular view of the world - User sees the real world through the medium of
the database - The measurements and samples contained in the
database must present as complete and accurate a
view of the world as possible
4Introduction (contd.)
- The contents of the database must be relevant in
terms of - themes and characteristics captured
- the time period covered
- the study area
5Representing Reality
- A database consists of digital representations of
discrete objects - The features shown on a map, e.g. lakes,
benchmarks, contours can be thought of as
discrete objects - The contents of a map can be captured in a
database by turning map features into database
objects
6Representing Reality (contd.)
- Many of the features shown on a map are
fictitious and do not exist in the real world - contours do not really exist, but houses and
lakes are real objects - The contents of a spatial database include
- digital versions of real objects, e.g. houses
- digital versions of artificial map features, e.g.
contours - artificial objects created for the purposes of
the database, e.g. pixels
7Data
- Data are facts
- some facts are more important to us than others.
Some facts are important enough to warrant
keeping track of them in a formal, organized way - "Data" is a broad concept that can include things
such as pictures (binary images), programs, and
rules - Informally, data are the things you want to store
in a database
8Spatial vs. Non-spatial Data
- Spatial data includes location, shape, size, and
orientation - Spatial data includes spatial relationships
- Non-spatial data (also called attribute or
characteristic data) is that information which is
independent of all geometric considerations
9Spatial vs. Non-spatial Data (contd.)
- It is possible to ignore the distinction between
spatial and non-spatial data. However, there are
fundamental differences between them - spatial data are generally multi-dimensional and
autocorrelated. - non-spatial data are generally one-dimensional
and independent
10Spatial vs. Non-spatial Data (contd.)
- These distinctions put spatial and non-spatial
data into different philosophical camps with
far-reaching implications for conceptual,
processing, and storage issues. - For example, sorting is perhaps the most common
and important non-spatial data processing
function that is performed - It is not obvious how to even sort locational
data such that all points end up nearby their
nearest neighbors
11Spatial Relationships
- Describe the association among different features
in space - are visually obvious when data are presented in
the graphical form - however, it is difficult to build spatial
relationships into the information organization
and data structure of a database
12Spatial Relationships (contd.)
- Difficulty in capturing spatial relationships in
a database - there are numerous types of spatial relationships
possible among features - recording spatial relationships implicitly
demands considerable storage space - computing spatial relationships on-the-fly slows
down data processing particularly if relationship
information is required frequently
13Point-Line-Area Relationship Matrix
14Spatial Relationships (contd.)
- Topological
- describes the property of adjacency, connectivity
and containment of contiguous features - Proximal
- describes the property of closeness of
non-contiguous features
15(No Transcript)
16Spatial Relationships (contd.)
- Spatial relationships are very important in
geographical data processing and modeling - the objective of information organization and
data structure is to find a way that will handle
spatial relationships with the minimum storage
and computation requirements
17Spatial Data
- Phenomena in the real world can be observed in
three modes spatial, temporal and thematic - the spatial mode deals with variation from place
to place - the temporal mode deals with variation from time
to time (one slice to another) - the thematic mode deals with variation from one
characteristic to another (one layer to another)
18Spatial Data (contd.)
- All measurable or describable properties of the
world can be considered to fall into one of these
modes - place, time and theme - An exhaustive description of all three modes is
not possible
19Spatial Data (contd.)
- When observing real-world phenomena we usually
hold one mode fixed, vary one in a controlled
manner, and measure the third (Sinton, 1978) - e.g. using a census of population we could fix a
time such as 1990, control for location using
census tracts and measure a theme such as the
percentage of persons owning automobiles
20Spatial Data (contd.)
- Holding geography fixed and varying time gives
longitudinal data - Holding time fixed and varying geography gives
cross- sectional data - The modes of information stored in a database
influence the types of problem solving that can
be accomplished
21Location
- The spatial mode of information is generally
called location
22Attributes
- Attributes capture the thematic mode by defining
different characteristics of objects - A table showing the attributes of objects is
called an attribute table - each object corresponds to a row of the table
- each characteristic or theme corresponds to a
column of the table - thus the table shows the thematic and some of the
spatial modes
23Time
- The temporal mode can be captured in several ways
- by specifying the interval of time over which an
object exists - by capturing information at certain points in
time - by specifying the rates of movement of objects
24Time (contd.)
- Depending on how the temporal mode is captured,
it may be included in a single attribute table,
or be represented by series of attribute tables
on the same objects through time
25Sampling Reality
- Numerical values may be defined with respect to
nominal, ordinal, interval, or ratio scales of
measurement - It is important to recognize the scales of
measurement used in GIS data as this determines
the kinds of mathematical operations that can be
performed on the data
26Sampling Reality (contd.)
27Marathon Example
28Sampling Reality (contd.)
- Distinctions, though important, are not always
clearly defined - Is elevation interval or ratio? if the local base
level is 750 feet, is a mountain at 2000 feet
twice as high as one at 1000 feet when viewed
from the valley? - Many types of geographical data used in GIS
applications are nominal or ordinal - Values establish the order of classes, or their
distinct identity, but rarely intervals or ratios
29Sampling Reality (contd.)
- Thus you cannot
- multiply soil type 2 by soil type 3 and get soil
type 6 - divide urban area by the rank of a city to get a
meaningful number - subtract suitability class 1 from suitability
class 4 to get 3 of anything - However, you can
- divide population by area (both ratio scales) and
get population density - subtract elevation at point a from elevation at
point b and get difference of elevation
30Multiple Representations
- A data model is essential to represent
geographical data in a digital database - There are many different data models
- The same phenomena may be represented in
different ways, at different scales and with
different levels of accuracy - Thus there may be multiple representations of the
same geographical phenomena
31Multiple Representations (contd.)
- It is difficult to convert from one
representation to another - e.g. from a small scale (1250,000) to a large
scale (110,000) - Thus it is common to find databases with multiple
representations of the same phenomenon - this is wasteful, but techniques to avoid it are
poorly developed
32Primary Data Sources
- Some of the data in a spatial database may have
been measured directly - e.g. by field sampling or remote sensing
- The density of sampling determines the resolution
of the data - e.g. samples taken every hour will capture
hour-to- hour variation, but miss shorter-term
variation - e.g. samples taken every 1 km will miss any
variation at resolutions less than 1 km
33Primary Data Sources (contd.)
- A sample is designed to capture the variation
present in a larger universe - e.g. a sample of places should capture the
variation present at all possible places - e.g. a sample of times will be designed to
capture variation at all possible times
34Sampling Approaches
- In a random sample, every place or time is
equally likely to be chosen - Systematic samples are chosen according to a
rule, e.g. every 1 km, but the rule is expected
to create no bias in the results of analysis,
i.e. the results would have been similar if a
truly random sample had been taken
35Sampling Approaches (contd.)
- In a stratified sample, the researcher knows for
some reason that the universe contains
significantly different sub-populations, and
samples within each sub-population in order to
achieve adequate representation of each - e.g. we may know that the topography is more
rugged in one part of the area, and sample more
densely there to ensure adequate representation - if a representative sample of the entire universe
is required, then the subsamples in each
subpopulation will have to be weighted
appropriately
36Secondary Data Sources
- Some data may have been obtained from existing
maps, tables, or other databases - To be useful, it is important to obtain
information in addition to the data themselves - information on the procedures used to collect and
compile the data - information on coding schemes, accuracy of
instruments
37Secondary Data Sources (contd.)
- Unfortunately such information is often not
available - a user of a spatial database may not know how the
data were captured and processed prior to input - this often leads to misinterpretation, false
expectations about accuracy
38Data Standards
- Standards may be set to assure uniformity
- within a single data set
- across data sets
- e.g. uniform information about timber types
throughout the database allows better fire
fighting methods to be used, or better control of
insect infestations - Data capture should be undertaken in standardized
ways that will assure the widest possible use of
the information
39Sharing Data
- It is not uncommon for as many as three agencies
to create databases with, ostensibly, the same
information - e.g. a planning agency may map landuse, including
a forested class - e.g. the state department of forestry also maps
forests - e.g. the wildlife division of the department of
conservation maps habitat, which includes fields
and forest
40Sharing Data (contd.)
- Each may digitize their forest class onto
different GIS systems, using different protocols,
and with different definitions for the classes of
forest cover - this is a waste of time and money
- Sharing information gives it added value
- Sharing basic formats with other information
providers, such as a department of
transportation, might make marketing the database
more profitable
41Errors and Accuracy
- There is a nearly universal tendency to lose
sight of errors once the data are in digital form
- are implanted in databases because of errors in
the original sources (source errors) - are added during data capture and storage
(processing errors) - occur when data are extracted from the computer
- arise when the various layers of data are
combined in an analytical exercise
42Errors in Sources
- Are extremely common in non-mapped source data,
such as locations of wells, or lot descriptions - Can be caused by doing inventory work from aerial
photography and misinterpreting images - Often occur because base maps are relied on too
heavily
43Classification Errors
- Are common when tabular data are rendered in map
form - Simple typing errors may be invisible until
presented graphically - More complex classification errors may be due to
the sampling strategies that produced the
original data
44Data Capture Errors
- Manual data input induces another set of errors
- Eye-hand coordination varies from operator to
operator and from time to time - Data input is a tedious task - it is difficult to
maintain quality over long periods of time
45Accuracy Standards
- Many agencies may not have established accuracy
standards for geographical data - these are more often concerned with accuracy of
locations of objects than with accuracy of
attributes - Higher accuracy requires better source materials
- is the added cost justified by the objectives of
the study? - Accuracy standards should be determined by
considering both the value of information and the
cost of collection
46Data Abstraction
- Capturing the essential pieces of information to
describe the spatial phenomenon - Based on a conceptual model of reality
- Expressed in data models
- Realized by building up of data structures (i.e.
internal representation of spatial data) using
database models
47Reality, Conceptual Model and Database
Database (Cyber world)
Reality
Conceptual model
48Data Abstraction Example
Locating objects
Recording attribute info.
Data (1.0, 20.3, 9.0, 12.8, 15.0, 10000.00)
49Data Abstraction Example
Data (1.0, 20.3, 9.0, 12.8, 15.0, 10000.00)
What do you mean by these data?
(1.0, 20.3)
Describing a house for rent
Z15.0
(9.0, 12.8)
Abstraction rule
Monthly rent PhP 10,000.00
Conceptual model of describing house for rent.
Crucial information for data users
50Levels of Data Abstraction
51Data Model vs. Database Model
- Data Model
- Vector methods (feature-based)
- Raster methods (field-based)
- Database Model
- Software implementation of data models
- Metwork, hierarchical and object-oriented
databases
52Vector Data Model
- Method of representing geographic features by the
basic graphical elements - Points
- Lines (arcs)
- Polygons (area)
- They can also be used to construct complex
features
53Basic Graphical Elements
54Vector Data Model (contd.)
- Related vector data are always organized by
themes, which are also referred to as layers or
coverages - examples of themes geodetic control, base map,
soil, vegetation cover, land use, transportation,
drainage and hydrology, political boundaries,
land parcel and others
55Vector Data Model (contd.)
- For themes covering a very large geographic area,
the data are always divided into tiles so that
they can be managed more easily - a tile is the digital equivalent of an individual
map in a map series - a tile is uniquely identified by a file name
56Vector Data Model (contd.)
- A collection of themes of vector data covering
the same geographic area and serving the common
needs of a multitude of users constitutes the
spatial component of a geographical database - Graphical data captured by imaging devices in
remote sensing and digital cartography (such as
multi-spectral scanners, digital cameras and
image scanners) are made up of a matrix of
picture elements (pixels) of very fine resolution
57Raster Data Model
- Method of representing geographic features by
pixels - A raster pixel is usually a square grid cell but
there are there are several variants such as
triangles and hexagons
58Raster Data Model (contd.)
- A raster pixel represents the generalized
characteristics of an area of specific size on or
near the surface of the Earth - the actual ground size depicted by a pixel is
dependent on the resolution of the data, which
may range from smaller than a square meter to
several square kilometers - Raster data are organized by themes, which is
also referred to as layers
59Raster Data Model (contd.)
- Raster data covering a large geographic area are
organized by scenes (for remote sensing images - The raster method is based on the concept that
geographic features are represented as surfaces,
regions or segments
60Vector Data Structure
- Spaghetti
- a direct line-for-line unstructured translation
of the paper map has very limited practical use - it is usually an interim data structure for map
digitizing - Hierarchical
- a vector data structure developed to facilitate
data retrieval by separately storing points,
lines and areas in a logically hierarchical manner
61Spaghetti Data Model and Data Structure
62Hierarchical Data Model and Data Structure
63Vector Data Structure (contd.)
- Topological
- a vector data structure that captures spatial
relationship by explicitly storing adjacency
information - the basic logical feature for line and area
coverage is a straight line segment - each individual line segment is defined by the
coordinates of its end points called nodes
64Topological Data Model and Data Structure
65Raster Data Structure
- Space is subdivided into regular grids of square
grid cells or other forms of polygonal meshes
known as picture elements (pixels) - the location of each cell is defined by its row
and column numbers - the area that each cell represents defines the
spatial resolution of the data - the position of a geographic feature is only
recorded to the nearest pixel
66Raster Data Structure (contd.)
- the value stored for each cell indicates the
types of the object, phenomenon or condition that
is found in that particular location - different types of values can be coded integers,
real numbers and alphabets - integer values often act as code numbers, which
are referenced to names in an associated table
(called the look-up table) or legend - different attributes at the same cell location
are stored as separate themes or layers
67Characteristics of Raster Data Structure
68Raster Data Structure (contd.)
- There are several variants to the regular grid
raster data structure, including - irregular tessellation (e.g. triangulated
irregular network (TIN)) - hierarchical tessellation (e.g. quad tree) and
- scan-line
69Representing Fields
- there are many ways of representing fields
- not all are implemented in GIS
- different terminologies exist in different
disciplines - Six major representations
- Regular cells
- Rectangular grid of points
- Irregularly spaced points
- Digitized contours
- Polygons
- Triangulated irregular networks (TINs)
70Regular Cells
- Value in each cell is an average, total, or some
other aggregate property of the field within the
cell - the representation defines a value everywhere, so
is complete - however, all within-cell variation is lost
- if necessary, it must be reconstructed by some
method of intelligent guesswork - e.g. remote sensing data and other kinds of
digital imagery
71(No Transcript)
72Rectangular grid of points
- e.g. measurements of land surface elevation in a
digital elevation model (DEM) - spacing of measurements is critical to accuracy
of representation - all variation between sample points is lost
- elevations at other points must be estimated by
some method of intelligent guesswork (the
representation is incomplete)
73(No Transcript)
74Irregularly spaced points
- The field's value is defined at a set of sample
points scattered in the frame - values of the field at other points must be
interpolated representation is incomplete - e.g. weather data, available at scattered weather
stations - accuracy depends on the density of points
- it is not clear what measure best defines
accuracy - density per unit area, minimum
distance between sample points, maximum distance
75(No Transcript)
76Digitized contours
- The field is represented as a set of isolines,
each connecting points of constant value - representation is incomplete
- The scale of measurement of the variable must be
at least ordinal - isolines cannot be defined for nominal data
- Each isoline is represented as a polyline
- e.g. data obtained from topographic maps
- Accuracy depends on
- the number of contoured values, or the contour
interval - the density of polyline points
77(No Transcript)
78Polygons
- The frame is partitioned into irregular areas
(volumes for 3 or more dimensions) - value in each area is an average, total, or some
other aggregate property of the field within the
area - the representation is complete
- all variation within areas is lost
- e.g. data obtained from maps of vegetation cover
class, soil type
79Polygons (contd.)
- The boundaries of areas are continuously curved
lines - represented digitally as polylines - an ordered
sequence of points connected by straight lines - the denser the points, the more accurate the
polyline as a representation of a continuous
curve - accuracy depends both on the size of polygons and
on the density of polyline points - it is not clear what measure of polygon size -
average, minimum - best defines accuracy
80Polygons (contd.)
- Every point in the frame lies in exactly one
polygon - except for points on the boundaries
- the polygons cannot overlap, must exhaust the
frame - they are said to tesselate the space, they form
an irregular tesselation
81(No Transcript)
82Triangulated irregular networks (TINs)
- the frame is covered with a mesh of irregular
triangles - every point lies in exactly one triangle, or on a
triangle edge - the value of the field is known at every triangle
vertex - within triangles and along edges it is assumed to
vary linearly - the representation is complete
- contours drawn across triangles will therefore
always be straight and parallel - across triangle edges there will be breaks of
slope, but not cliffs - contours will kink at edges
83Triangulated irregular networks (TINs)
- the scale of measurement of the variable must be
at least interval - variation within triangles cannot be defined for
nominal or ordinal variables
84Triangulated irregular networks (TINs)
- Accuracy depends on
- how carefully the vertices were located on the
surface - how well the planes defined within each triangle
fit the actual surface - the sizes of triangles
- but it is not clear what property of triangle
size best defines accuracy - average, smallest,
largest
85(No Transcript)
86Vector and Raster Data Integration
- Recent advances in computer technologies allow
these two types of data to be used in the same
applications - computers are now capable of converting data from
the vector format to the raster format
(rasterization) and vice versa (vectorization) - computers are now able to display vector and
raster simultaneously - vector and raster data are largely seen as
complimentary to, rather than competing against,
one another in geographic data processing
87Georelational Data Structure
- Was developed to handle geographic data
- It allows the association between spatial
(graphical) and non-spatial (descriptive) data - It is the data structure used by many
vector-based GIS software packages - Both spatial and non-spatial data are stored in
relational tables
88Georelational Data Structure (contd.)
- Point, line and polygon data are stored in
separate feature attribute tables (FAT) - in the FAT, each entity is assigned a unique
feature identifier (FID) - topological information is explicitly stored by
employing a method similar to the topological
data structure described above - non-spatial data are stored in relational tables
89Feature Attribute Table (FAT)
90Georelational Data Structure (contd.)
- Entities in the spatial and non-spatial
relational tables are linked by the common FIDs
of entities
91Linking spatial and non-spatial tables
92Data Modeling
- Process of defining real world phenomena or
geographic features of interest in terms of their
characteristics and their relationships with one
another - it is concerned with different phases of work
carried out to implement information organization
and data structure
93Data Modeling (contd.)
- There are three steps in the data modeling
process, resulting in a series of progressively
formalized data models as the form of the
database becomes more and more rigorously defined
- conceptual data modeling - defining in broad and
generic terms the scope and requirements of a
database - logical data modeling - specifying the user's
view of the database with a clear definition of
attributes and relationships - physical data modeling - specifying internal
storage structure and file organization of the
database
94Conceptual Data Modeling
- Entity-relationship (E-R) modeling is probably
the most popular method of conceptual data
modeling - It is sometimes referred to as a method of
semantic data modeling because it used a human
language-like vocabulary to describe information
organization
95Conceptual Data Modeling (contd.)
- It involves four aspects of work
- identifying entities
- an entity is defined as a person, a place, an
event, a thing, etc. - identifying attributes
- determining relationships
- drawing an entity-relationship diagram (E-R
diagram)
96Sample E-R Diagram
97Logical Data Modeling
- Comprehensive process by which the conceptual
data model is consolidated and refined - the proposed database is reviewed in its entirety
in order to identify potential problems such as - irrelevant data that will not be used
- omitted or missing data
- inappropriate representation of entities
- lack of integration between various parts of the
database - unsupported applications
- potential additional cost to revise the database
98Logical Data Modeling (contd.)
- The end product of logical data modeling is a
logical schema - the logical schema is developed by mapping the
conceptual data model (such as the E-R diagram)
to a software-dependent design document
99Logical Schema Example
100Physical Data Modeling
- Database design process by which the actual
tables that will be used to store the data are
defined in terms of - data format - the format of the data that is
specific to a database management system (DBMS) - storage requirements - the volume of the database
- physical location of data - optimizing system
performance by minimizing the need to transmit
data between different storage devices or data
servers
101Physical Data Modeling (contd.)
- The end product of physical data modeling is a
physical schema - a physical schema is also variably known as data
dictionary, item definition table, data specific
table or physical database definition - it is both software- and hardware specific
- this means the physical schemas for different
systems look different from one another
102Physical Schema Example
103End