Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

Description:

Ocean Biodiversity Informatics, Hamburg 29 Nov 2004. From Data to ... Wandering Albatros, NZ. Orca, San Francisco. Brown Algae, Argentina. Algae, New zealand ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 44

Provided by: arth49

Category:

more less

Transcript and Presenter's Notes

Title: Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

1
From Data to UncertaintyPrinciples of Data
Quality
Albatrosses, Kaikoura, New Zealand

Arthur D. Chapman

Australian Biodiversity Information Services
2
The Data Equation

Oceans of Data

Praia de Forte, Brazil
3
The Data Equation

Rivers of Information

Doubtful Sound, New Zealand
4
The Data Equation

Streams of
Knowledge

Wasatch, Utah, USA
5
The Data Equation

Drops of
Understanding

(Nix 1984)
6
Taking Data to Information
Decisions
Policy
Conservation
Management
Models
Decision Support
7
The Need for Modelling
Why do we need to use models?
Oceans Population 0 Collections ?10
million Plants ?? Vertebrates ?? Invertebrates
??
8
What do we mean by Data Quality?

An essential or distinguishing characteristic
necessary for spatial data to be fit for use.
SDTS 02/92
The general intent of describing the quality
of a particular dataset or record is to describe
the fitness of that dataset or record for a
particular use that one may have in mind for the
data.
Chrisman, 1991

9
Data quality - fitness for use?

Fitness for use
Does species A occur in Tasmania?
Does species A occur in National Park y

10
The Biological Data Domains
Errors can occur in any one of these

Plant and animal specimen data held in
museums provide a vast information resource,
providing not only present day information on the
locations of these entities, but also historic
information going back several hundred years
(Chapman and Busby 1994).

11
Loss of data quality

Loss of data quality can occur at many stages
At the time of collection
During digitisation
During documentation
During storage and archiving
During analysis and manipulation
At time of presentation
And through the use to which they are put

Dont underestimate the simple elegance of
quality improvement. Other than teamwork,
training, and discipline, it requires no special
skills. Anyone who wants to can be an effective
contributor. (Redman 2001).
12
Principles of data quality

The Vision
It is important for organizations to have a
vision with respect to having good quality data.
As well as a vision, an organization needs a
policy to implement that vision.
And a strategy for implementation

Experience has shown that treating data as a
long-term asset and managing it within a
coordinated framework produces considerable
savings and ongoing value. (NLWRA 2003).
13
The data quality vision

A Vision may involve
Not reinventing information management wheels
Looking for efficiencies in data collection and
quality control procedures
Sharing of data, information and tools
Using existing standards or developing new,
robust standards
Fostering the development of networks and
partnerships
Presenting a sound business case for data
collection and management
Reducing duplication in data collection and data
quality control
Looking beyond immediate use and examining
requirements of users
Ensuring that good documentation and metadata
procedures.

14
Strategies

Short term
- Data that can be assembled and checked over a
6-12 month period
Intermediate
- Data that can be entered over about an 18-month
period with small investment of resources
- Data that can be checked using simple in-house
methods
Long Term
- Data that can be entered and/or checked over a
longer time frame, using collaborative
arrangements

15
Information management chain
From Chapman 2004
16
Data Cleaning Principles -1

Planning is essential
develop a vision, a policy and strategy
Total Data Quality Management Cycle

1
17
Data Cleaning Principles - 2

Organising Data improves efficiency
The organizing of data prior to data checking,
validation and correction can improve efficiency
and considerably reduce the time and costs of
data cleaning.
For example, by sorting data on location,
efficiency gains can be achieved through checking
all records pertaining to the one location at the
same time, rather than going back and forth to
key references.
Similarly, by sorting records by collector and
date, it is possible to spot errors where a
record may be at an unlikely location for that
collector on that day.

18
Data Cleaning Principles - 3

Prevention is better than cure
It is far cheaper and more efficient to prevent
an error from happening, than to have to detect
it and correct it later. It is also important
that when errors are detected, that feedback
mechanisms ensure that the error doesnt occur
again during data entry, or that there is a much
lower likelihood of it re-occurring.

Asplenium bulbiferum, New Zealand
19
Data Cleaning Principles - 4

Responsibility belongs to everyone
(collector, custodian and user)
The principle responsibility belongs to the data
custodian
The collector has responsibility to respond to
the custodians questions when the custodian
finds errors or ambiguities that may refer back
to the original information supplied by the
collector. These may relate to ambiguities on the
label, errors in the date or location, etc.
The user also has a key responsibility to feed
back to custodians information on any errors or
omissions they may come across, including errors
in the documentation associated with the data.

20
Data Cleaning Principles - 5

Partnerships improve efficiency
By developing partnerships, many data validation
processes wont need to be duplicated, errors
will more likely be documented and corrected, and
new errors wont be incorporated by inadvertent
correction of suspect records that are not in
error.
Partnerships with
Data collectors
Other institutions with duplicate collections
Like-minded institutions developing tools,
standards and software
Key data brokers (e.g. OBIS, GBIF)
Data users (good feedback mechanisms)
Statisticians and data auditors

21
Data Cleaning Principles - 6

Prioritisation reduces duplication
Prioritisation helps reduce costs and improves
efficiency. It is often of value to concentrate
on those records where lots of data can be
cleaned at the lowest cost.
For example, those that can be examined using
batch processing or automated methods, before
working on the more difficult records.
By concentrating on those data that are of most
value to users, there is also a greater
likelihood of errors being detected and
corrected.

Tierra del Fuego, Argentina
22
Prioritising data quality procedures

Focus on most critical data first
Concentrate on discrete units (taxonomic,
geographic, etc.)
Ignore data that are not used or for which data
quality cannot be guaranteed
Consider data that are of broadest value, are of
greatest benefit to the majority of users and are
of value to the most diverse of uses
Work on those areas whereby lots of data can be
cleaned at the lowest cost (e.g. through use of
batch processing).

23
Data Cleaning Principles -7

Set targets and performance measures
Performance measures are a valuable addition to
quality control procedures,
They help an organization manage their data
cleaning processes.
Performance measures may include statistical
checks on the data (for example, 95 of all
records are within 1,000 meters of their reported
position), on the level of quality control (for
example 65 of all records have been checked by
a qualified taxonomist within the previous 5
years 90 have been checked by a qualified
taxonomist within the previous 10 years).

24
Data Cleaning Principles - 8

Minimise duplication and re-working of data
Duplication is a major factor with data cleaning
in most organizations.
Many organizations add the geocode at the same
time as they database the record. As records are
seldom sorted geographically, this means that the
same locations will be chased up a number of
times.
By carrying out the georeferencing as a special
operation, records from similar locations can
then be sorted and then the appropriate map-sheet
only has to be extracted once.
Some institutions also use the database itself to
help reduce duplication by searching to see if
the location may already have been georeferenced .

Nothofagus antarctica, Argentina
25
Data Cleaning Principles - 9

Feedback is a two-way street
Users of the data will inevitably carry out error
detection, and it is important that they feedback
the results to the custodians.
It is essential that data custodians encourage
feedback from users of their data, and take the
feedback that they receive seriously.
Data custodians also need to feed back
information on errors to the collectors and data
suppliers where relevant.
In this way there is a much higher likelihood
that the incidence of future errors will be
reduced and the overall data quality improved.

26
Data Cleaning Principles - 10

Education and training improves techniques
Poor training, especially at the data collection
and data entry stages of the Information Quality
Chain, is the cause of a large proportion of the
errors in primary species data.
Good training of data entry operators can reduce
the error associated with data entry
considerably, reduce data entry costs and improve
overall data quality.

Brown Algae, Argentina
27
Data Cleaning Principles - 11

Accountability, Transparency and Audit-ability
are important
Haphazard and unplanned data cleaning exercises
are very inefficient and generally unproductive.
Within data quality policies and strategies
clear lines of accountability for data cleaning
need to be established.
To improve the fitness for use of the data and
thus their quality, data cleaning processes need
to be transparent and well documented with a good
audit trail to reduce duplication and to ensure
that once corrected, errors never re-occur.

28
Data Cleaning Principles - 12

Documentation is the key to good data quality
Without good documentation, it is difficult for
users to determine the fitness for use of the
data and difficult for custodians to know what
and by whom data quality checks have been carried
out.
Documentation is generally of two types.
The first is tied to each record and records what
data checks have been done and what changes have
been made and by whom.
The second is the metadata that records
information at the dataset level.
Both are important, and without them, good data
quality is compromised.

29
On-line Tools and Guidelines
30
Recording Accuracy and Error

Additional Accuracy Fields
Preferably in meters (Point-Radius)

Documenting Validation tests
Who
What
How

31
Methods for geocode validation

Internal Database Checks
External Database Checks
Outliers in Geographic Space - GIS
Outliers in Environmental Space - Models
Statistical outliers

Butterfly, Florida, USA
32
Internal/External Database Checks

Logical inconsistencies within the database
Checking one field against another
Text location vs geocode or District/State
Checking one database against another
Gazetteers
DEM
Collectors

Magellanic Penguin, Argentina
33
Error

Error is inescapable and it should be
recognised as a fundamental dimension of data.
Chrisman 1991

34
Geographic outliers - GIS

Country, State, named district, etc.

Gazetteer of Brazilian localities
35
Geographic Outliers - GIS

Collectors location vs date

36
Diva-GIS - Outlier
www.diva-gis.org
37
CRIA-Data Cleaning
http//splink.cria.org.br/dc/
38
Principal Components Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001)
showing use of Principal Components Analysis to
identify an outlier in Rauvolfia littoralis
specimen data. A. Principal Components Analysis
B. Specimen record. C. Mapped specimen. D.
Climate profile
39
Cumulative Frequency Curves - DivaGiS
Results from Diva-GIS showing the use of the
Cumulative Frequency curve from BIOCLIM to
identify possible geocoding errors in Rauvolfia
littoralis. A1 and A2 show possible outliers in
climate space, B1 and B2 the corresponding mapped
records. The Blue lines represent the 97.5
percentile
40
Environmental Outliers

Cumulative Frequency Curves

41
Errors in data

Although most data gathering disciplines treat
error as an embarrassing issue to be expunged,
the error inherent in (spatial) data deserves
closer attention and public understanding.
Chrisman, 1991

42
Errors in data - 2

In general, error must not be treated as a
potentially embarrassing inconvenience, because
error provides a critical component in judging
fitness for use.
Chrisman, 1991

Mizodendrum sp., Argentina
43
Future Challengers
Future Challengers