Title: BibPro: A Citation Parser System
1BibPro A Citation Parser System
2Introduction
- Integrating bibliographical information
- Metadata
- author, title of the article, title of the book
containing the paper, journal name, month and
year of publication, etc. - Citation string
- Thousands of variations
- More than 2,000 formats in Endnote
- Citation Parsing Problem
- automatically recognize individual fields from a
given citation string - A template citation parser
3Goal
4Machine Learning
- Condition Random Field
- F. Peng, A. McCallum. Accurate information
extraction from research papers using conditional
random fields. Proceedings of Human Language
Technology Conference and North American Chapter
of the Association for Computational Linguistics
(HLT-NAACL), 2004, 329-336. - Support Vector Machine
- Hui Han, Giles, C.L., Manavoglu, E., Hongyuan
Zha, Zhenyue Zhang, Fox, E.A. Automatic document
metadata extraction using support vector
machines. Proceedings of the 3rd ACM/IEEE-CS
Joint Conference on Digital libraries, 2003,
37-48. - Hidden Markov Model
- K. Seymore, A. McCallum, R. Rosenfeld. Learning
hiddenMarkov model structure for information
extraction. AAAI-99Workshop on Machine Learning
for Information Extraction, 1999, 37-42. - Takasu, A. Bibliographic attribute extraction
from erroneous references based on a statistical
model. Proceedings of the 3rd ACM/IEEE-CS Joint
Conference on Digital libraries, 2003, 49-60. - Erik Hetzner. A simple method for citation
metadata extraction using hidden Markov models.
JCDL 2008.
5Knowledge Base
- A tree-like knowledge representation scheme that
organizes the knowledge of reference concepts in
a hierarchical fashion - Min-Yuh Day et al. Reference metadata extraction
using a hierarchical knowledge representation
framework. Decision Support Systems, 2006. - A knowledge base automatically constructed from
an existing set of sample metadata records of a
given area - E. Cortez, A. S. da Silva, M. A. Goncalves, F.
Mesquita, and E. S. de Moura. FLUX-CiM exible
unsupervised extraction of citation metadata. In
Proc. of the 7th ACM/IEEE Joint Conf. on Digital
Libraries, pages 215224, Vancouver, BC, Canada,
2007. ACM.
6Template Base
- Keep citation style as a template
- ParaCite http//paracite.eprints.org/
- I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and
Shian-Hua Lin. Extracting citation metadata from
online publication lists using BLAST. In PAKDD,
2004, 539-548. - Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao,
Jan-Ming Ho, BibPro A Citation Parser Based on
Sequence Alignment Techniques, ainaw,pp.1175-1180,
aina workshops 2008, 2008.
7Key Idea
- Encode a citation string into a template for
BLAST - Keeping citation style information into a protein
sequence - Utilizing bioinformatics sequence tools
- BLAST
- Using Domain Knowledge
- Reserved word
- Knowledge database (optional)
- Blocking rule (common sense knowledge)
8Question
- How many symbols can be used in a protein
sequence? - 23 symbols used in BLAST
- How many fields should be extracted from a
citation? - choose the most common used field
- Which punctuation marks are treated as partition
marks - base on domain knowledge
- How do we transform a citation into a protein
sequence and retain its structure feature? - define a encode table
9Encoding Procedure
10Encoding Table
Classification Symbol Representation
Contents Extracted Field A Author
Contents Extracted Field T Title
Contents Extracted Field L Venue (Journal, BookTitle, Technical Report)
Contents Extracted Field V Volume
Contents Extracted Field W Issue
Contents Extracted Field P Page
Contents Extracted Field Y Date (Year Month)
Contents Unknow Field X Single Unknow
Contents Unknow Field B Continuous Unknow
Contents Unknow Field N Numeral
Punctuation Mark Partition Mark R ,
Punctuation Mark Partition Mark D .
Punctuation Mark Partition Mark G "
Punctuation Mark Partition Mark E '
Punctuation Mark Partition Mark C
Punctuation Mark Partition Mark Z
Punctuation Mark Partition Mark H -
Punctuation Mark Brackets I ( lt
Punctuation Mark Brackets K ) gt
Punctuation Mark Misc Q / _ ! _at_ \ ? ?
Others F Editor
Others S Institution
Others M Publisher
11Encoding Knowledge
- A AUTHOR
- Name abbreviation
- T TITLE
- Length of Blocking
- L VENUE
- booktitle(conference) Proceedings Proc Workshop
Conf Conference Symposium Sympos Symp
International Intern Annual Annu - journaltitle Transactions Trans Journal
- techtitle(thesis) Tech rep Rpt TR Master Masters
Ph PhD Thesis thesis Dissertation dissertation - V VOLUME
- volume Volume volume Vol vol Vo vo
- issue Number number Nr nr No no NO Nos
- P PAGE
- page pp page pages PP Page Pages pg PG
12Encoding Knowledge
- Y DATE
- month January February March April May June July
August September October November December Jan
Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept - year 1900-2010
- F EDITOR
- editor eds Eds editors Editors editor Eds ED Ed
ed edited - S INSTITUTION
- institution University Univ Department Dept
Corporation - M PUBLISHER
- publisher Press Pub Publishers Inc Publications
13Tokenizing and Encoding Citation
M . Bianchini , P . Frasconi , and M . Gori , "
Learning in multilayered networks used as
autoassociators , " IEEE Transactions on Neural
Networks , vol . 6 , pp . 512 -515 , March 1995 .
14Blocking Mechanism
- After encoding the citation, we can utilize
semi-structured characteristic of citation by
some special pattern of sequence - Using blocking rule to merge special pattern into
a single unit - e.g ADXRA ? A
15Blocking(1/2)
16Blocking(2/2)
Index Form ?
ARGBRGLRVRPRYD (keep the blocking
area information start position and end
position e.g. A start0 end11 )
17Template Database
- A record in the Template Database
- A citation item with both citation string and
metadata - Style Form
- Index Form
- Once the template database has been constructed,
BibPro can provide the citation parsing service
on-the-fly
18Citation Style Template
- Index Form (Unknown Answer)
- Style Form (Known Answer)
19Parsing (Template Matching)
20Finding Citation Style Templates
- Using Score mechanism
- Finding similar citation style templates
- Blast Score Matrix
- which fields exist in query citation
- the order of partition mark represent a citation
style - Choose by IndexForm
- Align query citation with citation style template
- Score Matrix (dynamic programming)
- Content Symbol map to Content Symbol
- Partition Mark map to Partition Mark
- Choose the most suitable citation style template
according to alignment between IndexForm and
StyleForm
21 Classification Symbol Representation
Contents Extracted Field A Author
Contents Extracted Field T Title
Contents Extracted Field L Venue (Journal, BookTitle, Technical Report
Contents Extracted Field V Volume Issue
Contents Extracted Field P Page
Contents Extracted Field Y Date (Year Month)
Contents Unknow Field X Single Unknow
Contents Unknow Field B Continuous Unknow
Contents Unknow Field N Numeral
Punctuation Mark Partition Mark R ,
Punctuation Mark Partition Mark D .
Punctuation Mark Partition Mark G "
Punctuation Mark Partition Mark E '
Punctuation Mark Partition Mark C
Punctuation Mark Partition Mark Z
Punctuation Mark Partition Mark H -
Punctuation Mark Brackets I ( lt
Punctuation Mark Brackets K ) gt
Punctuation Mark Misc Q / _ ! _at_ \ ? ?
Others F Editor
Others S Institution
Others M Publisher
Others W
22Parsing (Alignment Extraction)
(Query) Index Form ARGBRGLRVRPRYD
(Template) Style Form
ARGTRGLRVRPRYD
ARGTRGLRVRPRYD
23System Architecture
24Experiment
- Dataset
- INFOMAP Dataset
- A total of 160,000 citation records were
collected from digital libraries on the Web - Citation string data was generated for each of
the six citation styles (APA, IEEE, ACM, MISQ,
JMIS, and ISR) - Cora Dataset
- 500 records with diversity citation style
- Flux-CIM Dataset
- 2000 HS-domain records
- 300 CS-domain records
25Experiment
- Evaluation
- Token-Level
- A is the number of true positive tokens
- B is the number of false negative tokens
- C is the number of false positive tokens
- D is the number of true negative tokens
- Field-Level
26INFOMAP Dataset Result
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 99.38 99.37 99.38 98.14
Title 99.58 97.23 98.39 95.02
Venue 98.00 97.85 97.92 96.36
Volume 96.44 97.83 98.90 97.36
Issue 98.89 95.58 97.21 94.04
Page 99.58 98.93 99.25 98.16
Date 99.29 99.59 99.44 99.16
Average 98.74 98.06 98.64 96.89
27Cora Dataset Result
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 95.87 97.75 96.79 90.99
Title 97.45 95.15 96.29 90.46
Venue 91.77 91.21 91.47 79.77
Volume 87.57 87.94 87.72 85.64
Issue 79.96 93.85 86.22 77.51
Page 97.15 97.01 97.07 95.78
Date 96.91 97.16 97.03 94.04
Average 92.38 94.30 93.22 87.74
28Flux-CIM HS Domain Dataset
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 93.45 99.74 96.49 93.30
Title 97.45 95.11 96.26 92.09
Venue 96.21 99.29 97.71 98.65
Volume 99.70 98.64 99.17 98.85
Issue n/a n/a n/a n/a
Page 100.00 97.07 98.51 96.99
Date 98.38 100.00 99.18 99.00
Average 97.53 98.31 97.89 96.48
29Flux-CIM CS Domain Dataset
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 97.89 98.26 98.05 96.64
Title 99.30 98.12 98.70 96.32
Venue 98.40 98.80 98.60 90.89
Volume 83.71 85.77 84.71 82.13
Issue 89.01 88.56 86.90 87.75
Page 99.47 96.91 98.17 96.17
Date 91.74 98.71 95.07 89.59
Average 94.22 95.02 94.31 91.35
30Conclusion
- Parsing citation is a challenging problem
- Diversity in citation formats
- We present a template-based parser
- Parser System
- http//csclws.iis.sinica.edu.tw8080/input.jsp
- Template Generator System
- http//csclws.iis.sinica.edu.tw8080/tpin.jsp
31Reference
- F. Peng, A. McCallum. Accurate information
extraction from research papers using conditional
random fields. Proceedings of Human Language
Technology Conference and North American Chapter
of the Association for Computational Linguistics
(HLT-NAACL), 2004, 329-336. - Hui Han, Giles, C.L., Manavoglu, E., Hongyuan
Zha, Zhenyue Zhang, Fox, E.A. Automatic document
metadata extraction using support vector
machines. Proceedings of the 3rd ACM/IEEE-CS
Joint Conference on Digital libraries, 2003,
37-48. - K. Seymore, A. McCallum, R. Rosenfeld. Learning
hiddenMarkov model structure for information
extraction. AAAI-99Workshop on Machine Learning
for Information Extraction, 1999, 37-42. - Takasu, A. Bibliographic attribute extraction
from erroneous references based on a statistical
model. Proceedings of the 3rd ACM/IEEE-CS Joint
Conference on Digital libraries, 2003, 49-60. - Min-Yuh Day et al. Reference metadata extraction
using a hierarchical knowledge representation
framework. Decision Support Systems, 2006. - E. Cortez, A. S. da Silva, M. A. Goncalves, F.
Mesquita, and E. S. de Moura. FLUX-CiM exible
unsupervised extraction of citation metadata. In
Proc. of the 7th ACM/IEEE Joint Conf. on Digital
Libraries, pages 215224, Vancouver, BC, Canada,
2007. ACM. - I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and
Shian-Hua Lin. Extracting citation metadata from
online publication lists using BLAST. In PAKDD,
2004, 539-548. - Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao,
Jan-Ming Ho, BibPro A Citation Parser Based on
Sequence Alignment Techniques, ainaw,pp.1175-1180,
aina workshops 2008, 2008. - Erik Hetzner. A simple method for citation
metadata extraction using hidden Markov models.
JCDL 2008. - S. F. Altschul, W. Gish, W. Miller, E. Myers and
D. Lipman. A basic local alignment search tool.
J. Mol. Biol., 215, 1990, 403-410. - Needleman, S. B. and Wunsch, C. D. A general
method applicable to the search for similarities
in the amino acid sequence of two proteins. J.
Mol. Biol., 48, 1970, 443-453.
32Thank You !