Citation Extractor presentation

About This Presentation

Transcript and Presenter's Notes

Title: Citation Extractor

1
Citation Extractor

Nguyen Bach
Sue Ann Hong
Ben Lambert

2
Extraction Task
AuthorOf(Author, Paper) PublishedAt(Paper,
Conference) IsPaper, IsAuthor, IsConference

Citation ltPaper, Authors, Conferencegt
Pattern
regular expression

3
Method Outline
Web pages (HTML, text)
Query Search (WIT)
Extract Patterns using known citations
Extract Citations using new patterns
Citations
Page-specific Patterns
4
Query "multiple-goal recognition from low-level
signals " " Xiaoyong Chai" " Qiang Yang" "AAAI
2005 "
Page http//www.informatik.uni-trier.de/ley/db/i
ndices/a-tree/y/YangQiang.html
5
Finding New Citations
6
The Challenge Patterns

Beginning and the end
Start token? End token? HTML tags?
?difficult to find length of token vs. general
NER?
These things should be talked about while viewing
the previous slide
Are regexs sufficient? (but not really relevant
for self-supervised learning)
Incorporating NER as a source of possible ENTITY
marker?
Use like AUTHOR, TITLE, CONF but with
probabilities/confidence values

7
System Spits Out

6 seeds ? 60 citations
36 of these (partial citations)
"Theory and Algorithms for Plan Merging " , "
Ming Li"
"The Expected Value of Hierarchical
Problem-Solving " , " Fahiem Bacchus"
"Handling feature interactions in
process-planning "
14 of these (partial strings)
"On D "
"On t " , " John Tromp", " Elizabeth Sweedyk", "
Umest Vazirani"
"An L " , " Ronan Sleep"
"To D
No new conferences (end-token)

8
Bootstrapping, Short-Lived

Highly restrictive regexs
No recovery
More seeds and variety the better
Stupid Little Things
Mis-capitalization
Variations in titles (- vs. )
Etc, etc, etc

9
Why is this one hard?
10
Extensions Improvements

Less strict string matching
Not case and punctuation sensitive
Better boundary detection
Start/end tokens, HTML wrapper detection?
Better pattern construction
e.g. n authors not 2
NER
help find the right "window
A source of ENTITY marker
Use like AUTHOR, TITLE, CONF but with
probabilities/confidence values
Evaluation with DBLP?

11
NER

Baseline model (News corpus)
ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing.
ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
Topological Representation of the Lambda
Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
gt Comp. Sci. (2000), vol. 10, pp. 81--96.
lt/ENAMEXgt
Adapted model (News citation corpus)
ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
International Conference on Acoustics, Speech,
lt/ENAMEXgt and Signal Processing.
ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
2000 lt/ENAMEXgt

12
NER

HMM-based Model (Bikels 99)
Baseline NER 94 F-score
Trained 1.1 million words in News and
Broadcastnews domain
Apply Baseline Model to recognize
Author, Conference , Location

13
NER Example with Baseline Model

ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing.
ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of LICS 2000
ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
Topological Representation of the Lambda
Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
gt Comp. Sci. (2000), vol. 10, pp. 81--96.
lt/ENAMEXgt
Good at detecting Author names boundaries, but
sometimes too aggressive.

14
Adaptation NER

Goals adapt baseline model to work better in
citation domain.
Issue No training data.
A Solution Take 300 citations Run baseline
model then recorrect them
Train multiply 300 citations by 10, then train
adaptation model with broadcast news corpus.

15
NER Example with Adaptation Model

ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
International Conference on Acoustics, Speech,
lt/ENAMEXgt and Signal Processing.
ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
2000 lt/ENAMEXgt
ltENAMEX_TYPE"PERSON"gt D. Litman, D. Bhembe, C.
P. Rose, K. Forbes-Riley, S. Silliman, K.
VanLehn (2004). lt/ENAMEXgt Spoken Versus Typed
Human and Computer Dialogue Tutoring, Proceedings
of the Intelligent Tutoring Systems Conference.

16
How NER can help?

Provide system generic Patterns.
AUTHOR M. Woszczyna, N. Aoki-Waibel, F. D. Buo,
N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A.
McNair, T. Polzin, I. Rogina, C. P. Rose, T.
Schultz, B. Suhm, M. Tomita, A. Waibel, 1994,
JANUS 93
CONFERENCE International Conference on
Acoustics, Speech
Then use specific rules to refine

17
Lessons LearnedAnother Boring Text Slide

Semi-structured text is surprisingly difficult to
read
Off-line training for wrappers and/or NER may
help
Need very high-confidence rules to ensure
precision
A continuously-running system needs robustness
(internet/Google-failure, unexpected errors, )

Write a Comment

User Comments (0)

About PowerShow.com

Citation Extractor PowerPoint PPT Presentation