Ocean Biodiversity Informatics, Hamburg 29 Nov 2004 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Ocean Biodiversity Informatics, Hamburg 29 Nov 2004

Description:

Ocean Biodiversity Informatics, Hamburg 29 Nov 2004. From Data to ... Wandering Albatros, NZ. Orca, San Francisco. Brown Algae, Argentina. Algae, New zealand ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 44
Provided by: arth49
Category:

less

Transcript and Presenter's Notes

Title: Ocean Biodiversity Informatics, Hamburg 29 Nov 2004


1
From Data to UncertaintyPrinciples of Data
Quality
Albatrosses, Kaikoura, New Zealand
  • Arthur D. Chapman

Australian Biodiversity Information Services
2
The Data Equation
  • Oceans of Data

Praia de Forte, Brazil
3
The Data Equation
  • Rivers of Information

Doubtful Sound, New Zealand
4
The Data Equation
  • Streams of
  • Knowledge

Wasatch, Utah, USA
5
The Data Equation
  • Drops of
  • Understanding

(Nix 1984)
6
Taking Data to Information
Decisions
Policy
Conservation
Management
Models
Decision Support
7
The Need for Modelling
Why do we need to use models?
Oceans Population 0 Collections ?10
million Plants ?? Vertebrates ?? Invertebrates
??
8
What do we mean by Data Quality?
  • An essential or distinguishing characteristic
    necessary for spatial data to be fit for use.
  • SDTS 02/92
  • The general intent of describing the quality
    of a particular dataset or record is to describe
    the fitness of that dataset or record for a
    particular use that one may have in mind for the
    data.
  • Chrisman, 1991

9
Data quality - fitness for use?
  • Fitness for use
  • Does species A occur in Tasmania?
  • Does species A occur in National Park y

10
The Biological Data Domains
Errors can occur in any one of these
  • Plant and animal specimen data held in
    museums provide a vast information resource,
    providing not only present day information on the
    locations of these entities, but also historic
    information going back several hundred years
  • (Chapman and Busby 1994).

11
Loss of data quality
  • Loss of data quality can occur at many stages
  • At the time of collection
  • During digitisation
  • During documentation
  • During storage and archiving
  • During analysis and manipulation
  • At time of presentation
  • And through the use to which they are put

Dont underestimate the simple elegance of
quality improvement. Other than teamwork,
training, and discipline, it requires no special
skills. Anyone who wants to can be an effective
contributor. (Redman 2001).
12
Principles of data quality
  • The Vision
  • It is important for organizations to have a
    vision with respect to having good quality data.
  • As well as a vision, an organization needs a
    policy to implement that vision.
  • And a strategy for implementation

Experience has shown that treating data as a
long-term asset and managing it within a
coordinated framework produces considerable
savings and ongoing value. (NLWRA 2003).
13
The data quality vision
  • A Vision may involve
  • Not reinventing information management wheels
  • Looking for efficiencies in data collection and
    quality control procedures
  • Sharing of data, information and tools
  • Using existing standards or developing new,
    robust standards
  • Fostering the development of networks and
    partnerships
  • Presenting a sound business case for data
    collection and management
  • Reducing duplication in data collection and data
    quality control
  • Looking beyond immediate use and examining
    requirements of users
  • Ensuring that good documentation and metadata
    procedures.

14
Strategies
  • Short term
  • - Data that can be assembled and checked over a
    6-12 month period
  • Intermediate
  • - Data that can be entered over about an 18-month
    period with small investment of resources
  • - Data that can be checked using simple in-house
    methods
  • Long Term
  • - Data that can be entered and/or checked over a
    longer time frame, using collaborative
    arrangements

15
Information management chain
From Chapman 2004
16
Data Cleaning Principles -1
  • Planning is essential
  • develop a vision, a policy and strategy
  • Total Data Quality Management Cycle

1
17
Data Cleaning Principles - 2
  • Organising Data improves efficiency
  • The organizing of data prior to data checking,
    validation and correction can improve efficiency
    and considerably reduce the time and costs of
    data cleaning.
  • For example, by sorting data on location,
    efficiency gains can be achieved through checking
    all records pertaining to the one location at the
    same time, rather than going back and forth to
    key references.
  • Similarly, by sorting records by collector and
    date, it is possible to spot errors where a
    record may be at an unlikely location for that
    collector on that day.

18
Data Cleaning Principles - 3
  • Prevention is better than cure
  • It is far cheaper and more efficient to prevent
    an error from happening, than to have to detect
    it and correct it later. It is also important
    that when errors are detected, that feedback
    mechanisms ensure that the error doesnt occur
    again during data entry, or that there is a much
    lower likelihood of it re-occurring.

Asplenium bulbiferum, New Zealand
19
Data Cleaning Principles - 4
  • Responsibility belongs to everyone
  • (collector, custodian and user)
  • The principle responsibility belongs to the data
    custodian
  • The collector has responsibility to respond to
    the custodians questions when the custodian
    finds errors or ambiguities that may refer back
    to the original information supplied by the
    collector. These may relate to ambiguities on the
    label, errors in the date or location, etc.
  • The user also has a key responsibility to feed
    back to custodians information on any errors or
    omissions they may come across, including errors
    in the documentation associated with the data.

20
Data Cleaning Principles - 5
  • Partnerships improve efficiency
  • By developing partnerships, many data validation
    processes wont need to be duplicated, errors
    will more likely be documented and corrected, and
    new errors wont be incorporated by inadvertent
    correction of suspect records that are not in
    error.
  • Partnerships with
  • Data collectors
  • Other institutions with duplicate collections
  • Like-minded institutions developing tools,
    standards and software
  • Key data brokers (e.g. OBIS, GBIF)
  • Data users (good feedback mechanisms)
  • Statisticians and data auditors

21
Data Cleaning Principles - 6
  • Prioritisation reduces duplication
  • Prioritisation helps reduce costs and improves
    efficiency. It is often of value to concentrate
    on those records where lots of data can be
    cleaned at the lowest cost.
  • For example, those that can be examined using
    batch processing or automated methods, before
    working on the more difficult records.
  • By concentrating on those data that are of most
    value to users, there is also a greater
    likelihood of errors being detected and
    corrected.

Tierra del Fuego, Argentina
22
Prioritising data quality procedures
  • Focus on most critical data first
  • Concentrate on discrete units (taxonomic,
    geographic, etc.)
  • Ignore data that are not used or for which data
    quality cannot be guaranteed
  • Consider data that are of broadest value, are of
    greatest benefit to the majority of users and are
    of value to the most diverse of uses
  • Work on those areas whereby lots of data can be
    cleaned at the lowest cost (e.g. through use of
    batch processing).

23
Data Cleaning Principles -7
  • Set targets and performance measures
  • Performance measures are a valuable addition to
    quality control procedures,
  • They help an organization manage their data
    cleaning processes.
  • Performance measures may include statistical
    checks on the data (for example, 95 of all
    records are within 1,000 meters of their reported
    position), on the level of quality control (for
    example 65 of all records have been checked by
    a qualified taxonomist within the previous 5
    years 90 have been checked by a qualified
    taxonomist within the previous 10 years).

24
Data Cleaning Principles - 8
  • Minimise duplication and re-working of data
  • Duplication is a major factor with data cleaning
    in most organizations.
  • Many organizations add the geocode at the same
    time as they database the record. As records are
    seldom sorted geographically, this means that the
    same locations will be chased up a number of
    times.
  • By carrying out the georeferencing as a special
    operation, records from similar locations can
    then be sorted and then the appropriate map-sheet
    only has to be extracted once.
  • Some institutions also use the database itself to
    help reduce duplication by searching to see if
    the location may already have been georeferenced .

Nothofagus antarctica, Argentina
25
Data Cleaning Principles - 9
  • Feedback is a two-way street
  • Users of the data will inevitably carry out error
    detection, and it is important that they feedback
    the results to the custodians.
  • It is essential that data custodians encourage
    feedback from users of their data, and take the
    feedback that they receive seriously.
  • Data custodians also need to feed back
    information on errors to the collectors and data
    suppliers where relevant.
  • In this way there is a much higher likelihood
    that the incidence of future errors will be
    reduced and the overall data quality improved.

26
Data Cleaning Principles - 10
  • Education and training improves techniques
  • Poor training, especially at the data collection
    and data entry stages of the Information Quality
    Chain, is the cause of a large proportion of the
    errors in primary species data.
  • Good training of data entry operators can reduce
    the error associated with data entry
    considerably, reduce data entry costs and improve
    overall data quality.

Brown Algae, Argentina
27
Data Cleaning Principles - 11
  • Accountability, Transparency and Audit-ability
    are important
  • Haphazard and unplanned data cleaning exercises
    are very inefficient and generally unproductive.
  • Within data quality policies and strategies
    clear lines of accountability for data cleaning
    need to be established.
  • To improve the fitness for use of the data and
    thus their quality, data cleaning processes need
    to be transparent and well documented with a good
    audit trail to reduce duplication and to ensure
    that once corrected, errors never re-occur.

28
Data Cleaning Principles - 12
  • Documentation is the key to good data quality
  • Without good documentation, it is difficult for
    users to determine the fitness for use of the
    data and difficult for custodians to know what
    and by whom data quality checks have been carried
    out.
  • Documentation is generally of two types.
  • The first is tied to each record and records what
    data checks have been done and what changes have
    been made and by whom.
  • The second is the metadata that records
    information at the dataset level.
  • Both are important, and without them, good data
    quality is compromised.

29
On-line Tools and Guidelines
30
Recording Accuracy and Error
  • Additional Accuracy Fields
  • Preferably in meters (Point-Radius)
  • Documenting Validation tests
  • Who
  • What
  • How

31
Methods for geocode validation
  • Internal Database Checks
  • External Database Checks
  • Outliers in Geographic Space - GIS
  • Outliers in Environmental Space - Models
  • Statistical outliers

Butterfly, Florida, USA
32
Internal/External Database Checks
  • Logical inconsistencies within the database
  • Checking one field against another
  • Text location vs geocode or District/State
  • Checking one database against another
  • Gazetteers
  • DEM
  • Collectors

Magellanic Penguin, Argentina
33
Error
  • Error is inescapable and it should be
    recognised as a fundamental dimension of data.
  • Chrisman 1991

34
Geographic outliers - GIS
  • Country, State, named district, etc.

Gazetteer of Brazilian localities
35
Geographic Outliers - GIS
  • Collectors location vs date

36
Diva-GIS - Outlier
www.diva-gis.org
37
CRIA-Data Cleaning
http//splink.cria.org.br/dc/
38
Principal Components Analysis - FloraMap
Image from FloraMap (Jones and Gladkov 2001)
showing use of Principal Components Analysis to
identify an outlier in Rauvolfia littoralis
specimen data. A. Principal Components Analysis
B. Specimen record. C. Mapped specimen. D.
Climate profile
39
Cumulative Frequency Curves - DivaGiS
Results from Diva-GIS showing the use of the
Cumulative Frequency curve from BIOCLIM to
identify possible geocoding errors in Rauvolfia
littoralis. A1 and A2 show possible outliers in
climate space, B1 and B2 the corresponding mapped
records. The Blue lines represent the 97.5
percentile
40
Environmental Outliers
  • Cumulative Frequency Curves

41
Errors in data
  • Although most data gathering disciplines treat
    error as an embarrassing issue to be expunged,
    the error inherent in (spatial) data deserves
    closer attention and public understanding.
  • Chrisman, 1991

42
Errors in data - 2
  • In general, error must not be treated as a
    potentially embarrassing inconvenience, because
    error provides a critical component in judging
    fitness for use.
  • Chrisman, 1991

Mizodendrum sp., Argentina
43
Future Challengers
Future Challengers
  • Improved data quality
  • Improved documentation of data
  • Improved access to distributed data
  • Improved methods for modelling in aquatic
    (including marine) environments
  • Decision Support Systems
  • Enlightened Policy / Decision Makers!!!

44
Thank You
  • Questions?

45
Acknowledgements
  • Brazilian Biota/FAPESP Virtual Biodiversity
    Institute Program
  • Reference Centre for Environmental Information,
    Brazil (CRIA)
  • Global Biodiversity Information Facility (GBIF)
  • UNESCO
  • Wesleyan University, Connecticut, USA
  • Peabody Museum, Yale University, USA
  • ETI, Holland
  • UN Food and Agriculture Organization (FAO)
  • Environmental Resources Information Network,
    Australia (ERIN)
  • Commission on Data for Science and Technology
    (CODATA)
Write a Comment
User Comments (0)
About PowerShow.com