Taxonomic Name Recognition TNR in Biodiversity Heritage Library - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Taxonomic Name Recognition TNR in Biodiversity Heritage Library

Description:

American Museum of Natural History (New York, NY) The Field ... Bulletin Medical Library Association 81:184-94. 8/28/09. TNR in BHL. 27. Questions? 8/28/09 ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 29
Provided by: qin7
Category:

less

Transcript and Presenter's Notes

Title: Taxonomic Name Recognition TNR in Biodiversity Heritage Library


1
Taxonomic Name Recognition (TNR) in Biodiversity
Heritage Library(???????????????)
  • Qin Wei(??), Chris Freeland, P. Bryan Heidorn
  • Missouri Botanical Garden

(??????)
2
Co-author
  • Chris Freeland
  • Director of Biodiversity Heritage Library
  • IT division manager of Missouri Botanical Garden
  • chris.freeland_at_mobot.org

3
Biodiversity Heritage Library(BHL)
  • Ten major natural history museum libraries,
    botanical libraries, and research institutions
    have joined to form the BHL. The group is
    developing a strategy and operational plan to
    digitize the published literature of biodiversity
    held in their respective collections. This
    literature will be available through a global
    biodiversity commons. (10??????????????????????)
  • More information about BHL could be found at
    http//www.biodiversitylibrary.org

4
Participating institutions(????)
  • American Museum of Natural History (New York, NY)
  • The Field Museum (Chicago, IL)
  • Harvard University Botany Libraries (Cambridge,
    MA)
  • Harvard University (Cambridge, MA)
  • Marine Biological Laboratory / Woods Hole
    Oceanographic Institution (Woods Hole, MA)
  • Missouri Botanical Garden (St. Louis, MO)
  • Natural History Museum (London, UK)
  • The New York Botanical Garden (New York, NY)
  • Royal Botanic Gardens, Kew (Richmond, UK)
  • Smithsonian Institution Libraries (Washington,
    DC)

5
Open Access(????????)
  • BHL Project strives to establish a major corpus
    of digitized publications on the Web drawn from
    the historical biodiversity literature. This
    material will be available for open access and
    responsible use as a part of a global
    Biodiversity Commons. We will work with the
    global taxonomic community, rights holders, and
    other interested parties to ensure that this
    legacy literature is available to all.

6
(No Transcript)
7
TNR in BHL(?????????????)
  • A significant aspect of BHL is the incorporation
    of algorithmic Taxonomic intelligence provided by
    uBio.org. (???????????????,?????????)
  • As materials are scanned, the image files are
    processed through ABBY FineReader or PrimeOCR to
    create text derivatives. Those text files are
    then submitted to uBios TaxonFinder web service
    to identifies strings in the text that match the
    characteristics of scientific names.
    (?????????????????)

8
Two TNR algorithms(2?????????????)
  • TaxonFinder is developed by uBio and it uses
    statistical models that were created from the
    validated organism names that are in NameBank.
  • These models aim to describe the structure and
    frequency of common character sequences of
    organism names, such that TaxonFinder can infer
    whether an unknown word has a similar structure
    as a known organism name. (???????????????)
  • Online only(??????)

9
Two TNR algorithms
  • FAT, short for Finds All Taxonomic names, was
    developed aiming to automatically extract all the
    taxonomic name from the biological literature.
  • It then use the parts already classified to build
    lexica and statistics (dictionary lookup), which
    will be used to classify the rest of the text.
    (Sautter et al)(?????????????????????)
  • Offline usage and customized dictionaries(????????
    ????)

10
(No Transcript)
11
Digitalization Process(?????)
?????
?????
12
Digitalization Process(?????)
13
Sample Characteristics (??)
14
Evaluation Measures(????)
  • Precision is the proportion of matching strings
    that are valid names. In our case,the precision
    means the capability of the algorithm to exclude
    the non-valid name in the result.(???)
  • Recall is the proportion of valid names in the
    whole database that were returned as true
    positives. It means the capability of finding all
    valid names from the database.(???)
  • In this evaluation, we also use a single measure
    F-score which is a harmonic mean of R and P
  • F-score2(PrecisionRecall)/(PrecisionRecall)

15
Sample Language Distribution(???????)
16
Ground Fact (????)
  • Total 3003 valid names

17
OCR Overall Performance(??????)
18
Error Breakdown
19
OCR-Is language matters?
20
TOP OCR error patterns
21
NameBank
  • For TaxonFinder, NameBank impleteness error rate
    is 6

22
Error Analysis
  • Exact Match

23
Overall Performances
24
Conclusion(??)
  • Our result indicate that TaxonFinder is slightly
    better than FAT. But even TaxonFinder only got an
    F-score of 38.47 which is relatively lower
    compared to other Named entity recognition
    results. For instance, the best system entering
    Message Understanding Conferences (MUC) scored
    93.39 of F-score while human annotators scored
    97.60 and 96.95.
  • We could see that there is a large space we could
    improve the algorithm to get better result.
    (??????????)

25
Future Work(??????)
  • Artificial Intelligent Retrieval is the trend
    (????????????)
  • How could we achieve it? (?????)
  • Experiments on machine learning methods (??????)
  • Using other external sources, e.g. ontologies
    (????????)
  • Automatic OCR correction (??????????)
  • Fuzzy matching algorithms in IR (??????)

26
References
  • 1 S. Rice, J. Kanai, and T. Nartker. An
    evaluation of OCR accuracy. In UNLV Information
    Science Research Institute Annual Report, pages
    9-20, 1993
  • 2 Koning, D., N. Sarkar, and T. Moritz. 2005.
    TaxonGrab Extracting Taxonomic Names from Text.
    Biodiversity Informatics 2 79-82.
  • 3 Sautter, G., K. Bohm, and D. Agosti. 2006. A
    combining approach to find all taxon names (FAT)
    in legacy biosisystematics literature.
    Biodiversity Informatics 3, 41-53.
    (http//jbi.nhm.ku.edu/index.php/jbi/article/view/
    34/16)
  • 4 McCray, A. T., A. R. Aronson, A. C. Browne,
    T. C. Rindflesch, A. Razi, and S. Srinivasan.
    1993. UMLS knowledge for biomedical language
    processing. Bulletin Medical Library Association
    81184-94

27
  • Questions?

28
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com