BibPro: A Citation Parser System - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

BibPro: A Citation Parser System

Description:

Page. P. Volume Issue. V. Venue (Journal, BookTitle, Technical Report. L. Title. T. Author ... of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR) ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 33
Provided by: Roc56
Category:

less

Transcript and Presenter's Notes

Title: BibPro: A Citation Parser System


1
BibPro A Citation Parser System
2
Introduction
  • Integrating bibliographical information
  • Metadata
  • author, title of the article, title of the book
    containing the paper, journal name, month and
    year of publication, etc.
  • Citation string
  • Thousands of variations
  • More than 2,000 formats in Endnote
  • Citation Parsing Problem
  • automatically recognize individual fields from a
    given citation string
  • A template citation parser

3
Goal
4
Machine Learning
  • Condition Random Field
  • F. Peng, A. McCallum. Accurate information
    extraction from research papers using conditional
    random fields. Proceedings of Human Language
    Technology Conference and North American Chapter
    of the Association for Computational Linguistics
    (HLT-NAACL), 2004, 329-336.
  • Support Vector Machine
  • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan
    Zha, Zhenyue Zhang, Fox, E.A. Automatic document
    metadata extraction using support vector
    machines. Proceedings of the 3rd ACM/IEEE-CS
    Joint Conference on Digital libraries, 2003,
    37-48.
  • Hidden Markov Model
  • K. Seymore, A. McCallum, R. Rosenfeld. Learning
    hiddenMarkov model structure for information
    extraction. AAAI-99Workshop on Machine Learning
    for Information Extraction, 1999, 37-42.
  • Takasu, A. Bibliographic attribute extraction
    from erroneous references based on a statistical
    model. Proceedings of the 3rd ACM/IEEE-CS Joint
    Conference on Digital libraries, 2003, 49-60.
  • Erik Hetzner. A simple method for citation
    metadata extraction using hidden Markov models.
    JCDL 2008.

5
Knowledge Base
  • A tree-like knowledge representation scheme that
    organizes the knowledge of reference concepts in
    a hierarchical fashion
  • Min-Yuh Day et al. Reference metadata extraction
    using a hierarchical knowledge representation
    framework. Decision Support Systems, 2006.
  • A knowledge base automatically constructed from
    an existing set of sample metadata records of a
    given area
  • E. Cortez, A. S. da Silva, M. A. Goncalves, F.
    Mesquita, and E. S. de Moura. FLUX-CiM exible
    unsupervised extraction of citation metadata. In
    Proc. of the 7th ACM/IEEE Joint Conf. on Digital
    Libraries, pages 215224, Vancouver, BC, Canada,
    2007. ACM.

6
Template Base
  • Keep citation style as a template
  • ParaCite http//paracite.eprints.org/
  • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and
    Shian-Hua Lin. Extracting citation metadata from
    online publication lists using BLAST. In PAKDD,
    2004, 539-548.
  • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao,
    Jan-Ming Ho, BibPro A Citation Parser Based on
    Sequence Alignment Techniques, ainaw,pp.1175-1180,
    aina workshops 2008, 2008.

7
Key Idea
  • Encode a citation string into a template for
    BLAST
  • Keeping citation style information into a protein
    sequence
  • Utilizing bioinformatics sequence tools
  • BLAST
  • Using Domain Knowledge
  • Reserved word
  • Knowledge database (optional)
  • Blocking rule (common sense knowledge)

8
Question
  • How many symbols can be used in a protein
    sequence?
  • 23 symbols used in BLAST
  • How many fields should be extracted from a
    citation?
  • choose the most common used field
  • Which punctuation marks are treated as partition
    marks
  • base on domain knowledge
  • How do we transform a citation into a protein
    sequence and retain its structure feature?
  • define a encode table

9
Encoding Procedure
10
Encoding Table
Classification Symbol Representation
Contents Extracted Field A Author
Contents Extracted Field T Title
Contents Extracted Field L Venue (Journal, BookTitle, Technical Report)
Contents Extracted Field V Volume
Contents Extracted Field W Issue
Contents Extracted Field P Page
Contents Extracted Field Y Date (Year Month)
Contents Unknow Field X Single Unknow
Contents Unknow Field B Continuous Unknow
Contents Unknow Field N Numeral
Punctuation Mark Partition Mark R ,
Punctuation Mark Partition Mark D .
Punctuation Mark Partition Mark G "
Punctuation Mark Partition Mark E '
Punctuation Mark Partition Mark C
Punctuation Mark Partition Mark Z
Punctuation Mark Partition Mark H -
Punctuation Mark Brackets I ( lt
Punctuation Mark Brackets K ) gt
Punctuation Mark Misc Q / _ ! _at_ \ ? ?
Others F Editor
Others S Institution
Others M Publisher
11
Encoding Knowledge
  • A AUTHOR
  • Name abbreviation
  • T TITLE
  • Length of Blocking
  • L VENUE
  • booktitle(conference) Proceedings Proc Workshop
    Conf Conference Symposium Sympos Symp
    International Intern Annual Annu
  • journaltitle Transactions Trans Journal
  • techtitle(thesis) Tech rep Rpt TR Master Masters
    Ph PhD Thesis thesis Dissertation dissertation
  • V VOLUME
  • volume Volume volume Vol vol Vo vo
  • issue Number number Nr nr No no NO Nos
  • P PAGE
  • page pp page pages PP Page Pages pg PG

12
Encoding Knowledge
  • Y DATE
  • month January February March April May June July
    August September October November December Jan
    Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept
  • year 1900-2010
  • F EDITOR
  • editor eds Eds editors Editors editor Eds ED Ed
    ed edited
  • S INSTITUTION
  • institution University Univ Department Dept
    Corporation
  • M PUBLISHER
  • publisher Press Pub Publishers Inc Publications

13
Tokenizing and Encoding Citation
M . Bianchini , P . Frasconi , and M . Gori , "
Learning in multilayered networks used as
autoassociators , " IEEE Transactions on Neural
Networks , vol . 6 , pp . 512 -515 , March 1995 .
14
Blocking Mechanism
  • After encoding the citation, we can utilize
    semi-structured characteristic of citation by
    some special pattern of sequence
  • Using blocking rule to merge special pattern into
    a single unit
  • e.g ADXRA ? A

15
Blocking(1/2)
16
Blocking(2/2)
Index Form ?
ARGBRGLRVRPRYD (keep the blocking
area information start position and end
position e.g. A start0 end11 )
17
Template Database
  • A record in the Template Database
  • A citation item with both citation string and
    metadata
  • Style Form
  • Index Form
  • Once the template database has been constructed,
    BibPro can provide the citation parsing service
    on-the-fly

18
Citation Style Template
  • Index Form (Unknown Answer)
  • Style Form (Known Answer)

19
Parsing (Template Matching)
20
Finding Citation Style Templates
  • Using Score mechanism
  • Finding similar citation style templates
  • Blast Score Matrix
  • which fields exist in query citation
  • the order of partition mark represent a citation
    style
  • Choose by IndexForm
  • Align query citation with citation style template
  • Score Matrix (dynamic programming)
  • Content Symbol map to Content Symbol
  • Partition Mark map to Partition Mark
  • Choose the most suitable citation style template
    according to alignment between IndexForm and
    StyleForm

21
Classification Symbol Representation
Contents Extracted Field A Author
Contents Extracted Field T Title
Contents Extracted Field L Venue (Journal, BookTitle, Technical Report
Contents Extracted Field V Volume Issue
Contents Extracted Field P Page
Contents Extracted Field Y Date (Year Month)
Contents Unknow Field X Single Unknow
Contents Unknow Field B Continuous Unknow
Contents Unknow Field N Numeral
Punctuation Mark Partition Mark R ,
Punctuation Mark Partition Mark D .
Punctuation Mark Partition Mark G "
Punctuation Mark Partition Mark E '
Punctuation Mark Partition Mark C
Punctuation Mark Partition Mark Z
Punctuation Mark Partition Mark H -
Punctuation Mark Brackets I ( lt
Punctuation Mark Brackets K ) gt
Punctuation Mark Misc Q / _ ! _at_ \ ? ?
Others F Editor
Others S Institution
Others M Publisher
Others W
22
Parsing (Alignment Extraction)
(Query) Index Form ARGBRGLRVRPRYD
(Template) Style Form
ARGTRGLRVRPRYD
ARGTRGLRVRPRYD
23
System Architecture
24
Experiment
  • Dataset
  • INFOMAP Dataset
  • A total of 160,000 citation records were
    collected from digital libraries on the Web
  • Citation string data was generated for each of
    the six citation styles (APA, IEEE, ACM, MISQ,
    JMIS, and ISR)
  • Cora Dataset
  • 500 records with diversity citation style
  • Flux-CIM Dataset
  • 2000 HS-domain records
  • 300 CS-domain records

25
Experiment
  • Evaluation
  • Token-Level
  • A is the number of true positive tokens
  • B is the number of false negative tokens
  • C is the number of false positive tokens
  • D is the number of true negative tokens
  • Field-Level

26
INFOMAP Dataset Result
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 99.38 99.37 99.38 98.14
Title 99.58 97.23 98.39 95.02
Venue 98.00 97.85 97.92 96.36
Volume 96.44 97.83 98.90 97.36
Issue 98.89 95.58 97.21 94.04
Page 99.58 98.93 99.25 98.16
Date 99.29 99.59 99.44 99.16
Average 98.74 98.06 98.64 96.89
27
Cora Dataset Result
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 95.87 97.75 96.79 90.99
Title 97.45 95.15 96.29 90.46
Venue 91.77 91.21 91.47 79.77
Volume 87.57 87.94 87.72 85.64
Issue 79.96 93.85 86.22 77.51
Page 97.15 97.01 97.07 95.78
Date 96.91 97.16 97.03 94.04
Average 92.38 94.30 93.22 87.74
28
Flux-CIM HS Domain Dataset
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 93.45 99.74 96.49 93.30
Title 97.45 95.11 96.26 92.09
Venue 96.21 99.29 97.71 98.65
Volume 99.70 98.64 99.17 98.85
Issue n/a n/a n/a n/a
Page 100.00 97.07 98.51 96.99
Date 98.38 100.00 99.18 99.00
Average 97.53 98.31 97.89 96.48
29
Flux-CIM CS Domain Dataset
Token-Level Token-Level Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 97.89 98.26 98.05 96.64
Title 99.30 98.12 98.70 96.32
Venue 98.40 98.80 98.60 90.89
Volume 83.71 85.77 84.71 82.13
Issue 89.01 88.56 86.90 87.75
Page 99.47 96.91 98.17 96.17
Date 91.74 98.71 95.07 89.59
Average 94.22 95.02 94.31 91.35
30
Conclusion
  • Parsing citation is a challenging problem
  • Diversity in citation formats
  • We present a template-based parser
  • Parser System
  • http//csclws.iis.sinica.edu.tw8080/input.jsp
  • Template Generator System
  • http//csclws.iis.sinica.edu.tw8080/tpin.jsp

31
Reference
  • F. Peng, A. McCallum. Accurate information
    extraction from research papers using conditional
    random fields. Proceedings of Human Language
    Technology Conference and North American Chapter
    of the Association for Computational Linguistics
    (HLT-NAACL), 2004, 329-336.
  • Hui Han, Giles, C.L., Manavoglu, E., Hongyuan
    Zha, Zhenyue Zhang, Fox, E.A. Automatic document
    metadata extraction using support vector
    machines. Proceedings of the 3rd ACM/IEEE-CS
    Joint Conference on Digital libraries, 2003,
    37-48.
  • K. Seymore, A. McCallum, R. Rosenfeld. Learning
    hiddenMarkov model structure for information
    extraction. AAAI-99Workshop on Machine Learning
    for Information Extraction, 1999, 37-42.
  • Takasu, A. Bibliographic attribute extraction
    from erroneous references based on a statistical
    model. Proceedings of the 3rd ACM/IEEE-CS Joint
    Conference on Digital libraries, 2003, 49-60.
  • Min-Yuh Day et al. Reference metadata extraction
    using a hierarchical knowledge representation
    framework. Decision Support Systems, 2006.
  • E. Cortez, A. S. da Silva, M. A. Goncalves, F.
    Mesquita, and E. S. de Moura. FLUX-CiM exible
    unsupervised extraction of citation metadata. In
    Proc. of the 7th ACM/IEEE Joint Conf. on Digital
    Libraries, pages 215224, Vancouver, BC, Canada,
    2007. ACM.
  • I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and
    Shian-Hua Lin. Extracting citation metadata from
    online publication lists using BLAST. In PAKDD,
    2004, 539-548.
  • Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao,
    Jan-Ming Ho, BibPro A Citation Parser Based on
    Sequence Alignment Techniques, ainaw,pp.1175-1180,
    aina workshops 2008, 2008.
  • Erik Hetzner. A simple method for citation
    metadata extraction using hidden Markov models.
    JCDL 2008.
  • S. F. Altschul, W. Gish, W. Miller, E. Myers and
    D. Lipman. A basic local alignment search tool.
    J. Mol. Biol., 215, 1990, 403-410.
  • Needleman, S. B. and Wunsch, C. D. A general
    method applicable to the search for similarities
    in the amino acid sequence of two proteins. J.
    Mol. Biol., 48, 1970, 443-453.

32
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com