Untangling Names - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Untangling Names

Description:

Die Flora der Deutschen Schutzgebiete in der Sudsee 1900. Duplication ... Hist. Pl. Pyren es 472. 1813. Asteraceae Hieracium sylvaticum Balb. ex Froel. -- Prodr. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 34
Provided by: julius7
Category:
Tags: hist | lam | names | untangling

less

Transcript and Presenter's Notes

Title: Untangling Names


1
Untangling Names
  • Lessons learned (so far) from the linking of
  • IPNI and TROPICOS
  • Julius Welby
  • RBG Kew
  • j.welby_at_kew.org

2
Background
3
TROPICOS IPNI
4
Why match?
5
Why is this difficult?
6
Variation
Calophyllum kiong K.Schum. Lauterb. Fl.
Deutsch. Sudsee, 450. Calophyllum kiong Lauterb.
K.Schum. Die Flora der Deutschen Schutzgebiete
in der Sudsee 1900
7
Duplication
  • Asteraceae Hieracium sylvaticum Lapeyr. -- Hist.
    Pl. Pyrenées 472. 1813
  • Asteraceae Hieracium sylvaticum Balb. ex Froel.
    -- Prodr. (DC.) vii. 232.
  • Asteraceae Hieracium sylvaticum Wahlenb. -- Fl.
    Lapp. 197 Fl. Suec. 515.
  • Asteraceae Hieracium sylvaticum Porta -- in Atti
    Accad. Agiati, ix. 1891 (1892) 45.
  • Asteraceae Hieracium sylvaticum Wallr. -- Sched.
    Crit. 422. (IK)
  • Asteraceae Hieracium sylvaticum Jan ex Froel. --
    Prodr. (DC.) vii. 221.
  • Asteraceae Hieracium sylvaticum Gouan -- Illustr.
    56 Retz. Obs. i. 27.
  • Asteraceae Hieracium sylvaticum Salisb. -- Prod.
    181.
  • Asteraceae Hieracium sylvaticum Lam. -- Fl. Fr.
    ii. 96.
  • Asteraceae Hieracium sylvaticum Bertol. -- Fl.
    Ital. viii. 485.

8
Duplication
  • Poa annua L. -- Sp. Pl. 68. 1753 (GCI)
  • Poa annua L. -- Species Plantarum 2 1753 (APNI)
  • Poa annua L. -- Sp. Pl. 68. (IK)

9
Duplication
  • Calophyllum microphyllum Scheffin Tijdschr.
    Nederl. Ind. xxxii. (1871) 406. (IK)
  • Calophyllum microphyllum Planch. Trianain Ann.
    Sc. Nat. Ser. IV. xv. (1861) 282. (IK)
  • Calophyllum microphyllum T.Anders.Fl. Brit. Ind.
    (J. D. Hooker). i. 272. (IK)

10
Matching
11
Fields
  • 1 Calophyllum Calophyllum
  • 2 kiong kiong
  • 3 K.Schum. Lauterb. Lauterb. K.Schum.
  • Fl. Deutsch. Sudsee Die Flora der Deutschen
  • 450. 1900

12
Lesson 1
Speed matters
13
Speed matters
2,500 by 2,000 by 4 fields 20,000,000
comparisons 5.5 hours at 1ms per comparison
14
Be lazy
15
Be lazy
  • Do as little as possible
  • Do easy things if possible
  • Do hard things only if necessary
  • Only expend effort when its worth it

16
Be lazy
  • Do as little as possible
  • Specify fields as must match
  • If a must match field fails
  • Mark the match as failed
  • Stop comparing fields

17
speciesinfragenusinfraspeciesauthorsrank
Parameterised matching
18
How lazy?
19
Optimising
  • The order of field matching is important
  • Choose suitable fields to match first
  • Aim to fail matches early
  • Significant speed-up

20
Also, for speed
  • Do as little as possible
  • Do escaping or standardisation once
  • Done on import for each dataset
  • Keep field matching functions clean

21
More speed optimisation
  • Do easy things if possible
  • Define cascading tests
  • Do easy tests first, if practical
  • Length comparisons
  • Composition comparisons

22
Speed Lessons
  • Speed matters
  • Minimise comparisons made
  • Must match parameters
  • Match fields in an efficient order
  • Do data cleaning once, up front
  • Look for ways to fail matches cheaply

23
Accuracy
24
Accuracy
False -
OK
False
25
Strict match
F-
OK
26
Fuzzy match
F
OK
27
Doughnut of uncertainty
28
Lesson 2Look at near misses
29
Near misses are checkable
30
One approach
  • Currently, to get best results
  • Tend towards strictness
  • Handle false negatives

31
One approach
  • Currently, best results from
  • Tend towards strictness
  • Handle false negatives
  • Failures on rightmost fields can be written to
    a report
  • Checked and fed back in as escapes
  • Rerun

32
Lesson 3Remove predictable variation
33
Predictable variation
  • Gendered endings
  • Common alternatives
  • Endings
  • ii,i
  • Iae,ae
  • Dataset specific quirks
  • amp,

34
The framework
  • Python
  • Psyco
  • Modular
  • Extensible
  • In progress
  • More details will be available on the TDWG
    website
  • Source code availability

35
The framework
  • Some results (HTML)

36
Thanks to
  • Bob Magill
  • Sally Hinchcliffe
  • The Moore Foundation
  • Contact
  • j.welby_at_kew.org
  • or after Jan 2007 julius.welby_at_gmail.com
Write a Comment
User Comments (0)
About PowerShow.com