Combining Central Crop Databases for EURISCO - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Combining Central Crop Databases for EURISCO

Description:

create data-sets per country for use by focal persons ... automatic transformation if ECP acronym' was used, otherwise transfer to ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 29
Provided by: your67
Category:

less

Transcript and Presenter's Notes

Title: Combining Central Crop Databases for EURISCO


1
Combining Central Crop Databases for EURISCO
  • Frank Menting

2
Content
  • background
  • gathering the data
  • conversion to EURISCO format
  • analysis of consolidated data
  • Documentation Quality Index (DQI)
  • comparison CCBDs versus National Inventories
    (Ni)
  • conclusions

3
Background
  • part EPGRIS project
  • combination of all available ECP/GR CCDBs
  • create test-data-set for EURISCO
  • create data-sets per country for use by focal
    persons
  • comparison / completion National Inventories

4
Background
5
Gathering the data
  • 45 CCDBs listed on ECP/GR site (July 2002)
  • 20 downloaded from Internet site
  • 15 received following Email correspondence
  • 10 not received (under development or no reply)
  • 35 data sets were combined
  • included the largest crops (small grains)

6
Conversion to EURISCO format
  • data were converted into EURSICO format
  • EURISCO format MCPD 5 additional fields
  • NICODE (identification of national inventory)
  • BREDDESCR (description of breeder)
  • DONORDESCR (description of donor)
  • DUPLDESCR (description of duplication site)
  • ACCEURL (hot link to detailed accession
    information)

7
Conversion to EURISCO format
  • steps conversion
  • importing dbf, mdb or txt formatted files into
    Excel
  • match columns to EURISCO descriptors
  • check format of all columns
  • VBA macros
  • transform format when required
  • as far as possible VBA macros

8
Conversion to EURISCO format
  • main conversion issues
  • institution codes had to be FAO Institute Codes
  • automatic transformation if ECP acronym was
    used, otherwise transfer to corresponding DESCR
    column
  • in case of INSTCODE (mandatory) more effort to
    complete
  • contributors section of the CCDB sites
  • search the FAO-list manually

9
Conversion to EURISCO format
  • main conversion issues
  • requirements taxonomic fields
  • no standardization concerning classification
    system
  • authority's name was moved to appropriate fields

10
Conversion to EURISCO format
  • main conversion issues
  • country codes had to be on extended ISO 3166-1
    list
  • NICODE was copied from INSTCODE
  • incorrect ORIGCTY was corrected via COLLSITE,
    where feasible
  • correction of standard errors, examples
  • ROM -gt ROU (Romania)
  • JAP -gt JPN (Japan)
  • GER -gt DEU (Germany)

11
Conversion to EURISCO format
  • main conversion issues
  • dates ware transformed into YYYYMMDD format
  • very many formats appeared, even within one
    database
  • examples from one database
  • YY
  • YYYY
  • YYMM
  • YYMMDD
  • DDMMYY
  • DD.MM.YY
  • DD.MM.YYYY

12
Conversion to EURISCO format
  • main conversion issues
  • number- and name-fields were not changed
  • for example, if ACCNAME contained local (5993
    times) this was untouched
  • low integrity of accession numbers

13
Conversion to EURISCO format
  • main conversion issues
  • collection site was often compiled from other
    fields
  • example COLLSITE PROVENCE STATE

14
Conversion to EURISCO format
  • main conversion issues
  • longitude and latitude appeared in many formats
  • in case of doubt about minutes or seconds they
    were replaced by hyphens

15
Conversion to EURISCO format
  • main conversion issues
  • coded information (SAMPSTAT, COLLSRC, STORAGE)
    not always followed MCPD v2
  • MCPD v1 was transformed
  • other codes were transformed on basis of
    additional information

16
Analysis of consolidated data
  • Documentation Quality Index (DQI)
  • tool for analyzing quality/completeness of data
    sets
  • higher index -gt more complete data
  • each type (SAMPSTAT) of accession can reach same
    maximum
  • based on the occurrence of values in fields, an
    index is calculated for each record
  • quality of the value is not considered

17
Analysis of consolidated data
  • Documentation Quality Index (DQI)
  • examples of calculation
  • if SPECIES has a value 25 points
  • if SPAUTHOR has a value 2 points provided that
    SPECIES has a value
  • if LONGITUDE has a value 7 points provided that
    SAMPSTAT starts with 1 (wild) or 2 (weedy),
    and LATITUDE has a value 5 points provided that
    SAMPSTAT starts with 3 (landrace), and LATITUDE
    has a value 2 points if SAMPSTAT has no value
    otherwise 0 points
  • valuation of fields is subjective !!
  • how important is knowing the SPAUTHOR relative to
    ALTITUDE ?

18
Analysis of consolidated data
  • total number of accessions 507732
  • number of GENUS/SPECIES combinations 1360
  • 420 of these only have 1 accession
  • 941 of these have less then 10 accessions
  • largest CCDBs barley (129507 acc.) and wheat
    (108229 acc.)
  • smallest CCDB Trisetum (87 acc.)
  • CCDB with most genera Umbellifer (47 genera)
  • oldest accession a white cabbage from 1726
  • highest origin barley from Tibet collected at
    4650 m

19
Analysis of consolidated data
  • distribution over sample status
  • nearly half unknown
  • weedy occurs only 445 times (0)

20
Analysis of consolidated data
  • occurrence of values per descriptor
  • 10 of 34 descriptors gt 50, 11 descriptors lt10

21
Analysis of consolidated data
  • DQI in CCDBs
  • theoretical maximum is 150
  • possible for all types of material
  • theoretical minimum is 0
  • only mandatory fields have a value
  • varied from 0 to 129, average 57.0

22
Analysis of consolidated data
  • DQI per holding country
  • varied from 37 to 84 (countries with gt 1000 acc)

23
Analysis of consolidated data
  • DQI per crop
  • varied from 28 to 73 (crops with gt 1000 acc)

24
Comparison NI lt-gt CCDB
  • comparison National Inventory lt-gt CCDB slice
  • number of accessions
  • average DQI for common accessions
  • results
  • expected size of EURISCO
  • effect of conversion via CCDBs into EURISCO
    format versus conversion into NI

25
Comparison NI lt-gt CCDB
  • number of accessions
  • CCDB contains between 38 and 74 of accessions
    in current NIs

26
Comparison NI lt-gt CCDB
  • DQI of common accessions
  • DQI decreases in case of HUN (14000 acc) and NLD
    (13000 acc) - not in the case of DEU (53000 acc)

27
Conclusions
  • many CCDBs have low data quality
  • various formats in one column, irrelevant values,
    etc.
  • conversion of data generally reduces quality
  • more accessions in NIs then in CCDBs
  • expected number of accessions in EURISCO is
    1,000,000 300,000

28
Acknowledgements
  • CCDB NI curators
  • Vanessa Fens Theo van Hintum (CGN)
  • Attila Simon (Institute for Agrobotany,
    Tápiószele, Hungary)
Write a Comment
User Comments (0)
About PowerShow.com