Title: Automatic Metadata Extraction Darwin Core From Museum Specimen Labels
1Automatic Metadata Extraction (Darwin Core) From
Museum Specimen Labels (????????????????)
ltcogt Curtis, lt/cogtlthdlcgt North
American Pl lt/hdlcgtltcnlgt No.lt/cnlgtltcngt
503lt/cngt ltgngt Polygalalt/gngtltspgt
ambigua,lt/spgtltsagt Nutt.,lt/sagtltvalgt
var.lt/valgt lthbgt Coral soil,lt/hbgtltlcgt Cudjoe Key,
South Florida. lt/lcgtltcolgt Legitlt/colgtltcogt A. H.
Curtiss.lt/cogtltdtgtFebruarylt/dtgt
- Qin Wei (??), P. Bryan Heidorn,
- University of Illinois at Urbana-Champaign,
USA(????????????) - Email qinwei2_at_illinois.edu
2About me
- Phd student in Information Science in UIUC(?????)
- Research area information retrieval, natural
language processing, text mining(??????????,????,
??????????) - Dissertation Topic Taxonomic Name Recognition
from full text (?????????????????????) - Expected Graduation in Fall 2009
- Master in UIUC
- Bachelors degree in Information Management in
Peking University(???????????)
3Co-author
- Dr. P. Bryan Heidorn
- Professor of Graduate School of Library and
Information Science at UIUC - pheidorn_at_illinois.edu
4The problem
- More than 1 Billion Natural History
Specimens(??10?????) - Collected over 250 years / many
languages(????,??250????) - No publishing standards(?????)
- Near infinite classes (????????)
- 6 min / label 1B labels 100M hours (1???)
- Saving 1 min 16.7 Million hours(1600???)
- 10/hr 167,000,000(1?6????)
5Why care
- Historic distribution of species (????)
- Ecological niche modeling (invasiveness, crop
hardiness, pest potential) (????) - Projections of the impact of climate change
(????)
6The Project
- Yale University Herbarium
- New York Botanical Garden
- University of Illinois
- Funded by National Science Foundation
7Metadata(???)
- Data about data(???????)
- Author James Smith
- Date August, 14, 2008
- Compare to
- Author James Smith
- Date August, 14, 2008
- The importance of Metadata (???????)
- Dublin Core in library science (??????)
- Darwin Core in TDWG (More information could be
found here (??????)http//wiki.tdwg.org/twiki/bin/
view/DarwinCore/WebHome)
8Some Elements from Darwin Core (??????????????)
- Class
- Order
- Family
- "Genus"
- "Species"
- "Subspecies"
- "ScientificNameAuthor"
- "IdentifiedBy"
- "YearIdentified"
- "MonthIdentified"
- "DayIdentified"
9Why Machine Learning?(????)
- Successfully adopted in other related/similar
areas information retrieval, named entity
recognition (??????????) - Many many tools are already available. (e.g.
Weka, D2K)(?????????/??) - More adaptable to data variability such as
spelling variability (?????????) - Can be user driven not programmer
driven(???????)each user may fine tune their own
models(????????????????????)
10Supervised Machine Learning(????????)
- The method operates under supervision by being
provided with the actual outcome for each of the
training examples (Witten, 2005) - (????)
- In another words, the learner gets the knowledge
from the examples and then use the knowledge to
classify new examples.(??????)
11Work flow
12Sample records
13Sample OCR Output
- Yale University Herbarium
- r-""" r-n-------
- YU.001300
- Curtisb, North American Pl
- Co.nr r-n
- ANTS,
- No. 503 "
- Polygala ambigna, Nntt., var.
- Coral soil, Cudjoe Key, South Florida.
- Legit A. H. Curtiss.
14Example Training Record
- lt?xml version"1.0" encoding"UTF-8"?gt
- lt?oxygen RNGSchema"http//www3.isrl.uiuc.edu/Tel
eNature/Herbis/semanticrelax.rng" type"xml"?gt - ltlabeldatagt
- ltbtgtYale University Herbarium
- lt/btgtltnsgt r-""" r-n------lt/nsgtltbcgt YU.001300
- lt/bcgtltco cc"Curtiss"gt Curtisb, lt/cogtlthdlc
cc"North American Plants"gt North
American Pl - lt/hdlcgtltnsgtCo.nr r-n
- ANTS,lt/nsgt
- ltcnlgt No.lt/cnlgtltcngt 503lt/cngtltnsgt "lt/nsgt
- ltgngt Polygalalt/gngtltspgt ambigna,lt/spgtltsagt
Nntt.,lt/sagtltvalgt var.lt/valgt - lthbgt Coral soil,lt/hbgtltlcgt Cudjoe Key, South
Florida. - lt/lcgtltcolgt Legitlt/colgtltcogt A. H. Curtiss.lt/cogt
- lt/labeldatagt
15Supervised Learning Framework
Unclassified Labels
Training Phase
?????
??????
Gold Classified Labels
Machine Learner
Human Editing
????
????
????
????
Application Phase
????
Unclassified Labels
?????
Machine Classifier
Segmented Text
Silver Classified Labels
Segmentation
???
????
????
?????
16Experimental Data(????)
- 295 marked up records(295??????)
- printed labels, no handwriting (?????,??????)
- 74 label states (74???)
- NaiveBayes classifier VS. Hidden Markov Model
(?????????????????) - 5-fold cross-validation (5?????)
17Performances of NB and HMM
18Future Work (??????)
- Community Learning Models (???????)
- Label records might be processed in different
orders to maximize learning and minimize error
rate (??????????) - OCR correction might be improved using context
dependent information. Context dependent
correction means conducting the correct after
knowing the words class. For example, word
Ourtiss should be corrected as Curtiss. If
the system already identified Ourtiss as
collector, we can use the smaller collector
dictionary instead of using a much larger general
dictionary to do the correction. (??????????)
19Community Learning Models
??????
??????
Unclassified Labels
Training Phase
?????
Evaluation
Gold Classified Labels
Machine Learner
Trained Model
Human Editing
????
????
????
????
Application Phase
????
Unclassified Labels
?????
Machine Classifier
Segmented Text
Silver Classified Labels
Segmentation
???
????
????
?????
20(No Transcript)
21References
- Witten, I. H., and Frank, E. (2005). Data mining
practical machine learning tools and techniques
(2 ed.). Boston, MA Morgan Kaufmann Publishers. - Cui, H., and Heidorn, P. B. (2007). The
reusability of induced knowledge for the
automatic semantic markup of Taxonomic
Descriptions. Journal of the American Society for
Information Science and Technolog, 58(1),
133-149. - Bluma, A. L., and Langley, P. (1997).Selection of
relevant features and examples in machine
learning. Artificial Intelligence, 97, 245-271. - Melville, P., and Mooney, R. J. (2003).
Constructing diverse classifier ensembles using
artificial training examples. In Proceedings of
the IJCAI-2003, 505-510.