Title: Ocean Biodiversity Informatics, Hamburg 29 Nov 2004
1From Data to UncertaintyPrinciples of Data
Quality
Albatrosses, Kaikoura, New Zealand
Australian Biodiversity Information Services
2The Data Equation
Praia de Forte, Brazil
3The Data Equation
Doubtful Sound, New Zealand
4The Data Equation
Wasatch, Utah, USA
5The Data Equation
(Nix 1984)
6Taking Data to Information
Decisions
Policy
Conservation
Management
Models
Decision Support
7The Need for Modelling
Why do we need to use models?
Oceans Population 0 Collections ?10
million Plants ?? Vertebrates ?? Invertebrates
??
8What do we mean by Data Quality?
- An essential or distinguishing characteristic
necessary for spatial data to be fit for use. - SDTS 02/92
-
- The general intent of describing the quality
of a particular dataset or record is to describe
the fitness of that dataset or record for a
particular use that one may have in mind for the
data. - Chrisman, 1991
9Data quality - fitness for use?
- Fitness for use
- Does species A occur in Tasmania?
- Does species A occur in National Park y
10The Biological Data Domains
Errors can occur in any one of these
- Plant and animal specimen data held in
museums provide a vast information resource,
providing not only present day information on the
locations of these entities, but also historic
information going back several hundred years - (Chapman and Busby 1994).
11Loss of data quality
- Loss of data quality can occur at many stages
- At the time of collection
- During digitisation
- During documentation
- During storage and archiving
- During analysis and manipulation
- At time of presentation
- And through the use to which they are put
Dont underestimate the simple elegance of
quality improvement. Other than teamwork,
training, and discipline, it requires no special
skills. Anyone who wants to can be an effective
contributor. (Redman 2001).
12Principles of data quality
- The Vision
- It is important for organizations to have a
vision with respect to having good quality data. - As well as a vision, an organization needs a
policy to implement that vision. - And a strategy for implementation
Experience has shown that treating data as a
long-term asset and managing it within a
coordinated framework produces considerable
savings and ongoing value. (NLWRA 2003).
13The data quality vision
- A Vision may involve
- Not reinventing information management wheels
- Looking for efficiencies in data collection and
quality control procedures - Sharing of data, information and tools
- Using existing standards or developing new,
robust standards - Fostering the development of networks and
partnerships - Presenting a sound business case for data
collection and management - Reducing duplication in data collection and data
quality control - Looking beyond immediate use and examining
requirements of users - Ensuring that good documentation and metadata
procedures.
14Strategies
- Short term
- - Data that can be assembled and checked over a
6-12 month period - Intermediate
- - Data that can be entered over about an 18-month
period with small investment of resources - - Data that can be checked using simple in-house
methods - Long Term
- - Data that can be entered and/or checked over a
longer time frame, using collaborative
arrangements
15Information management chain
From Chapman 2004
16Data Cleaning Principles -1
- Planning is essential
- develop a vision, a policy and strategy
- Total Data Quality Management Cycle
1
17Data Cleaning Principles - 2
- Organising Data improves efficiency
- The organizing of data prior to data checking,
validation and correction can improve efficiency
and considerably reduce the time and costs of
data cleaning. - For example, by sorting data on location,
efficiency gains can be achieved through checking
all records pertaining to the one location at the
same time, rather than going back and forth to
key references. - Similarly, by sorting records by collector and
date, it is possible to spot errors where a
record may be at an unlikely location for that
collector on that day.
18Data Cleaning Principles - 3
- Prevention is better than cure
- It is far cheaper and more efficient to prevent
an error from happening, than to have to detect
it and correct it later. It is also important
that when errors are detected, that feedback
mechanisms ensure that the error doesnt occur
again during data entry, or that there is a much
lower likelihood of it re-occurring.
Asplenium bulbiferum, New Zealand
19Data Cleaning Principles - 4
- Responsibility belongs to everyone
- (collector, custodian and user)
- The principle responsibility belongs to the data
custodian - The collector has responsibility to respond to
the custodians questions when the custodian
finds errors or ambiguities that may refer back
to the original information supplied by the
collector. These may relate to ambiguities on the
label, errors in the date or location, etc. - The user also has a key responsibility to feed
back to custodians information on any errors or
omissions they may come across, including errors
in the documentation associated with the data.
20Data Cleaning Principles - 5
- Partnerships improve efficiency
- By developing partnerships, many data validation
processes wont need to be duplicated, errors
will more likely be documented and corrected, and
new errors wont be incorporated by inadvertent
correction of suspect records that are not in
error. - Partnerships with
- Data collectors
- Other institutions with duplicate collections
- Like-minded institutions developing tools,
standards and software - Key data brokers (e.g. OBIS, GBIF)
- Data users (good feedback mechanisms)
- Statisticians and data auditors
21Data Cleaning Principles - 6
- Prioritisation reduces duplication
- Prioritisation helps reduce costs and improves
efficiency. It is often of value to concentrate
on those records where lots of data can be
cleaned at the lowest cost. - For example, those that can be examined using
batch processing or automated methods, before
working on the more difficult records. - By concentrating on those data that are of most
value to users, there is also a greater
likelihood of errors being detected and
corrected.
Tierra del Fuego, Argentina
22Prioritising data quality procedures
- Focus on most critical data first
- Concentrate on discrete units (taxonomic,
geographic, etc.) - Ignore data that are not used or for which data
quality cannot be guaranteed - Consider data that are of broadest value, are of
greatest benefit to the majority of users and are
of value to the most diverse of uses - Work on those areas whereby lots of data can be
cleaned at the lowest cost (e.g. through use of
batch processing).
23Data Cleaning Principles -7
- Set targets and performance measures
- Performance measures are a valuable addition to
quality control procedures, - They help an organization manage their data
cleaning processes. - Performance measures may include statistical
checks on the data (for example, 95 of all
records are within 1,000 meters of their reported
position), on the level of quality control (for
example 65 of all records have been checked by
a qualified taxonomist within the previous 5
years 90 have been checked by a qualified
taxonomist within the previous 10 years).
24Data Cleaning Principles - 8
- Minimise duplication and re-working of data
- Duplication is a major factor with data cleaning
in most organizations. - Many organizations add the geocode at the same
time as they database the record. As records are
seldom sorted geographically, this means that the
same locations will be chased up a number of
times. - By carrying out the georeferencing as a special
operation, records from similar locations can
then be sorted and then the appropriate map-sheet
only has to be extracted once. - Some institutions also use the database itself to
help reduce duplication by searching to see if
the location may already have been georeferenced .
Nothofagus antarctica, Argentina
25Data Cleaning Principles - 9
- Feedback is a two-way street
- Users of the data will inevitably carry out error
detection, and it is important that they feedback
the results to the custodians. - It is essential that data custodians encourage
feedback from users of their data, and take the
feedback that they receive seriously. - Data custodians also need to feed back
information on errors to the collectors and data
suppliers where relevant. - In this way there is a much higher likelihood
that the incidence of future errors will be
reduced and the overall data quality improved.
26Data Cleaning Principles - 10
- Education and training improves techniques
- Poor training, especially at the data collection
and data entry stages of the Information Quality
Chain, is the cause of a large proportion of the
errors in primary species data. - Good training of data entry operators can reduce
the error associated with data entry
considerably, reduce data entry costs and improve
overall data quality.
Brown Algae, Argentina
27Data Cleaning Principles - 11
- Accountability, Transparency and Audit-ability
are important - Haphazard and unplanned data cleaning exercises
are very inefficient and generally unproductive. - Within data quality policies and strategies
clear lines of accountability for data cleaning
need to be established. - To improve the fitness for use of the data and
thus their quality, data cleaning processes need
to be transparent and well documented with a good
audit trail to reduce duplication and to ensure
that once corrected, errors never re-occur.
28Data Cleaning Principles - 12
- Documentation is the key to good data quality
- Without good documentation, it is difficult for
users to determine the fitness for use of the
data and difficult for custodians to know what
and by whom data quality checks have been carried
out. - Documentation is generally of two types.
- The first is tied to each record and records what
data checks have been done and what changes have
been made and by whom. - The second is the metadata that records
information at the dataset level. - Both are important, and without them, good data
quality is compromised.
29On-line Tools and Guidelines
30Recording Accuracy and Error
- Additional Accuracy Fields
- Preferably in meters (Point-Radius)
- Documenting Validation tests
- Who
- What
- How
31Methods for geocode validation
- Internal Database Checks
- External Database Checks
- Outliers in Geographic Space - GIS
- Outliers in Environmental Space - Models
- Statistical outliers
Butterfly, Florida, USA
32Internal/External Database Checks
- Logical inconsistencies within the database
- Checking one field against another
- Text location vs geocode or District/State
- Checking one database against another
- Gazetteers
- DEM
- Collectors
Magellanic Penguin, Argentina
33Error
- Error is inescapable and it should be
recognised as a fundamental dimension of data. - Chrisman 1991
34Geographic outliers - GIS
- Country, State, named district, etc.
Gazetteer of Brazilian localities
35Geographic Outliers - GIS
- Collectors location vs date
36Diva-GIS - Outlier
www.diva-gis.org
37CRIA-Data Cleaning
http//splink.cria.org.br/dc/
38Principal Components Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001)
showing use of Principal Components Analysis to
identify an outlier in Rauvolfia littoralis
specimen data. A. Principal Components Analysis
B. Specimen record. C. Mapped specimen. D.
Climate profile
39Cumulative Frequency Curves - DivaGiS
Results from Diva-GIS showing the use of the
Cumulative Frequency curve from BIOCLIM to
identify possible geocoding errors in Rauvolfia
littoralis. A1 and A2 show possible outliers in
climate space, B1 and B2 the corresponding mapped
records. The Blue lines represent the 97.5
percentile
40Environmental Outliers
- Cumulative Frequency Curves
41Errors in data
- Although most data gathering disciplines treat
error as an embarrassing issue to be expunged,
the error inherent in (spatial) data deserves
closer attention and public understanding. - Chrisman, 1991
42Errors in data - 2
- In general, error must not be treated as a
potentially embarrassing inconvenience, because
error provides a critical component in judging
fitness for use. - Chrisman, 1991
Mizodendrum sp., Argentina
43Future Challengers
Future Challengers
- Improved data quality
- Improved documentation of data
- Improved access to distributed data
- Improved methods for modelling in aquatic
(including marine) environments - Decision Support Systems
- Enlightened Policy / Decision Makers!!!
44Thank You
45Acknowledgements
- Brazilian Biota/FAPESP Virtual Biodiversity
Institute Program - Reference Centre for Environmental Information,
Brazil (CRIA) - Global Biodiversity Information Facility (GBIF)
- UNESCO
- Wesleyan University, Connecticut, USA
- Peabody Museum, Yale University, USA
- ETI, Holland
- UN Food and Agriculture Organization (FAO)
- Environmental Resources Information Network,
Australia (ERIN) - Commission on Data for Science and Technology
(CODATA)