Automatic Metadata Extraction Darwin Core From Museum Specimen Labels - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Automatic Metadata Extraction Darwin Core From Museum Specimen Labels

Description:

Research area: information retrieval, natural language processing, text mining ... Ecological niche modeling (invasiveness, crop hardiness, pest potential) (????) ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 22
Provided by: idealsI
Category:

less

Transcript and Presenter's Notes

Title: Automatic Metadata Extraction Darwin Core From Museum Specimen Labels


1
Automatic Metadata Extraction (Darwin Core) From
Museum Specimen Labels (????????????????)
ltcogt Curtis, lt/cogtlthdlcgt North
American Pl lt/hdlcgtltcnlgt No.lt/cnlgtltcngt
503lt/cngt ltgngt Polygalalt/gngtltspgt
ambigua,lt/spgtltsagt Nutt.,lt/sagtltvalgt
var.lt/valgt lthbgt Coral soil,lt/hbgtltlcgt Cudjoe Key,
South Florida. lt/lcgtltcolgt Legitlt/colgtltcogt A. H.
Curtiss.lt/cogtltdtgtFebruarylt/dtgt
  • Qin Wei (??), P. Bryan Heidorn,
  • University of Illinois at Urbana-Champaign,
    USA(????????????)
  • Email qinwei2_at_illinois.edu

2
About me
  • Phd student in Information Science in UIUC(?????)
  • Research area information retrieval, natural
    language processing, text mining(??????????,????,
    ??????????)
  • Dissertation Topic Taxonomic Name Recognition
    from full text (?????????????????????)
  • Expected Graduation in Fall 2009
  • Master in UIUC
  • Bachelors degree in Information Management in
    Peking University(???????????)

3
Co-author
  • Dr. P. Bryan Heidorn
  • Professor of Graduate School of Library and
    Information Science at UIUC
  • pheidorn_at_illinois.edu

4
The problem
  • More than 1 Billion Natural History
    Specimens(??10?????)
  • Collected over 250 years / many
    languages(????,??250????)
  • No publishing standards(?????)
  • Near infinite classes (????????)
  • 6 min / label 1B labels 100M hours (1???)
  • Saving 1 min 16.7 Million hours(1600???)
  • 10/hr 167,000,000(1?6????)

5
Why care
  • Historic distribution of species (????)
  • Ecological niche modeling (invasiveness, crop
    hardiness, pest potential) (????)
  • Projections of the impact of climate change
    (????)

6
The Project
  • Yale University Herbarium
  • New York Botanical Garden
  • University of Illinois
  • Funded by National Science Foundation

7
Metadata(???)
  • Data about data(???????)
  • Author James Smith
  • Date August, 14, 2008
  • Compare to
  • Author James Smith
  • Date August, 14, 2008
  • The importance of Metadata (???????)
  • Dublin Core in library science (??????)
  • Darwin Core in TDWG (More information could be
    found here (??????)http//wiki.tdwg.org/twiki/bin/
    view/DarwinCore/WebHome)

8
Some Elements from Darwin Core (??????????????)
  • Class
  • Order
  • Family
  • "Genus"
  • "Species"
  • "Subspecies"
  • "ScientificNameAuthor"
  • "IdentifiedBy"
  • "YearIdentified"
  • "MonthIdentified"
  • "DayIdentified"

9
Why Machine Learning?(????)
  • Successfully adopted in other related/similar
    areas information retrieval, named entity
    recognition (??????????)
  • Many many tools are already available. (e.g.
    Weka, D2K)(?????????/??)
  • More adaptable to data variability such as
    spelling variability (?????????)
  • Can be user driven not programmer
    driven(???????)each user may fine tune their own
    models(????????????????????)

10
Supervised Machine Learning(????????)
  • The method operates under supervision by being
    provided with the actual outcome for each of the
    training examples (Witten, 2005)
  • (????)
  • In another words, the learner gets the knowledge
    from the examples and then use the knowledge to
    classify new examples.(??????)

11
Work flow
12
Sample records
13
Sample OCR Output
  • Yale University Herbarium
  • r-""" r-n-------
  • YU.001300
  • Curtisb, North American Pl
  • Co.nr r-n
  • ANTS,
  • No. 503 "
  • Polygala ambigna, Nntt., var.
  • Coral soil, Cudjoe Key, South Florida.
  • Legit A. H. Curtiss.

14
Example Training Record
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lt?oxygen RNGSchema"http//www3.isrl.uiuc.edu/Tel
    eNature/Herbis/semanticrelax.rng" type"xml"?gt
  • ltlabeldatagt
  • ltbtgtYale University Herbarium
  • lt/btgtltnsgt r-""" r-n------lt/nsgtltbcgt YU.001300
  • lt/bcgtltco cc"Curtiss"gt Curtisb, lt/cogtlthdlc
    cc"North American Plants"gt North
    American Pl
  • lt/hdlcgtltnsgtCo.nr r-n
  • ANTS,lt/nsgt
  • ltcnlgt No.lt/cnlgtltcngt 503lt/cngtltnsgt "lt/nsgt
  • ltgngt Polygalalt/gngtltspgt ambigna,lt/spgtltsagt
    Nntt.,lt/sagtltvalgt var.lt/valgt
  • lthbgt Coral soil,lt/hbgtltlcgt Cudjoe Key, South
    Florida.
  • lt/lcgtltcolgt Legitlt/colgtltcogt A. H. Curtiss.lt/cogt
  • lt/labeldatagt

15
Supervised Learning Framework
Unclassified Labels
Training Phase
?????
??????
Gold Classified Labels
Machine Learner
Human Editing
????
????
????
????
Application Phase
????
Unclassified Labels
?????
Machine Classifier
Segmented Text
Silver Classified Labels
Segmentation
???
????
????
?????
16
Experimental Data(????)
  • 295 marked up records(295??????)
  • printed labels, no handwriting (?????,??????)
  • 74 label states (74???)
  • NaiveBayes classifier VS. Hidden Markov Model
    (?????????????????)
  • 5-fold cross-validation (5?????)

17
Performances of NB and HMM
18
Future Work (??????)
  • Community Learning Models (???????)
  • Label records might be processed in different
    orders to maximize learning and minimize error
    rate (??????????)
  • OCR correction might be improved using context
    dependent information. Context dependent
    correction means conducting the correct after
    knowing the words class. For example, word
    Ourtiss should be corrected as Curtiss. If
    the system already identified Ourtiss as
    collector, we can use the smaller collector
    dictionary instead of using a much larger general
    dictionary to do the correction. (??????????)

19
Community Learning Models
??????
??????
Unclassified Labels
Training Phase
?????
Evaluation
Gold Classified Labels
Machine Learner
Trained Model
Human Editing
????
????
????
????
Application Phase
????
Unclassified Labels
?????
Machine Classifier
Segmented Text
Silver Classified Labels
Segmentation
???
????
????
?????
20
(No Transcript)
21
References
  • Witten, I. H., and Frank, E. (2005). Data mining
    practical machine learning tools and techniques
    (2 ed.). Boston, MA Morgan Kaufmann Publishers.
  • Cui, H., and Heidorn, P. B. (2007). The
    reusability of induced knowledge for the
    automatic semantic markup of Taxonomic
    Descriptions. Journal of the American Society for
    Information Science and Technolog, 58(1),
    133-149.
  • Bluma, A. L., and Langley, P. (1997).Selection of
    relevant features and examples in machine
    learning. Artificial Intelligence, 97, 245-271.
  • Melville, P., and Mooney, R. J. (2003).
    Constructing diverse classifier ensembles using
    artificial training examples. In Proceedings of
    the IJCAI-2003, 505-510.
Write a Comment
User Comments (0)
About PowerShow.com