Title: Citation Extractor
1Citation Extractor
- Nguyen Bach
- Sue Ann Hong
- Ben Lambert
2Extraction Task
AuthorOf(Author, Paper) PublishedAt(Paper,
Conference) IsPaper, IsAuthor, IsConference
- Citation ltPaper, Authors, Conferencegt
- Pattern
- regular expression
3Method Outline
Web pages (HTML, text)
Query Search (WIT)
Extract Patterns using known citations
Extract Citations using new patterns
Citations
Page-specific Patterns
4Query "multiple-goal recognition from low-level
signals " " Xiaoyong Chai" " Qiang Yang" "AAAI
2005 "
Page http//www.informatik.uni-trier.de/ley/db/i
ndices/a-tree/y/YangQiang.html
5Finding New Citations
6The Challenge Patterns
- Beginning and the end
- Start token? End token? HTML tags?
- ?difficult to find length of token vs. general
- NER?
- These things should be talked about while viewing
the previous slide - Are regexs sufficient? (but not really relevant
for self-supervised learning) - Incorporating NER as a source of possible ENTITY
marker? - Use like AUTHOR, TITLE, CONF but with
probabilities/confidence values
7System Spits Out
- 6 seeds ? 60 citations
- 36 of these (partial citations)
- "Theory and Algorithms for Plan Merging " , "
Ming Li" - "The Expected Value of Hierarchical
Problem-Solving " , " Fahiem Bacchus" - "Handling feature interactions in
process-planning " - 14 of these (partial strings)
- "On D "
- "On t " , " John Tromp", " Elizabeth Sweedyk", "
Umest Vazirani" - "An L " , " Ronan Sleep"
- "To D
- No new conferences (end-token)
8Bootstrapping, Short-Lived
- Highly restrictive regexs
- No recovery
- More seeds and variety the better
- Stupid Little Things
- Mis-capitalization
- Variations in titles (- vs. )
- Etc, etc, etc
9Why is this one hard?
10Extensions Improvements
- Less strict string matching
- Not case and punctuation sensitive
- Better boundary detection
- Start/end tokens, HTML wrapper detection?
- Better pattern construction
- e.g. n authors not 2
- NER
- help find the right "window
- A source of ENTITY marker
- Use like AUTHOR, TITLE, CONF but with
probabilities/confidence values - Evaluation with DBLP?
11NER
- Baseline model (News corpus)
- ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing. - ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
Topological Representation of the Lambda
Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
gt Comp. Sci. (2000), vol. 10, pp. 81--96.
lt/ENAMEXgt - Adapted model (News citation corpus)
- ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
International Conference on Acoustics, Speech,
lt/ENAMEXgt and Signal Processing. - ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
2000 lt/ENAMEXgt
12NER
- HMM-based Model (Bikels 99)
- Baseline NER 94 F-score
- Trained 1.1 million words in News and
Broadcastnews domain - Apply Baseline Model to recognize
- Author, Conference , Location
13NER Example with Baseline Model
- ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the International Conference on
Acoustics, Speech, and Signal Processing. - ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of LICS 2000 - ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
Topological Representation of the Lambda
Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
gt Comp. Sci. (2000), vol. 10, pp. 81--96.
lt/ENAMEXgt - Good at detecting Author names boundaries, but
sometimes too aggressive.
14Adaptation NER
- Goals adapt baseline model to work better in
citation domain. - Issue No training data.
- A Solution Take 300 citations Run baseline
model then recorrect them - Train multiply 300 citations by 10, then train
adaptation model with broadcast news corpus.
15NER Example with Adaptation Model
- ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
Towards Spontaneous Speech Translation,
Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
International Conference on Acoustics, Speech,
lt/ENAMEXgt and Signal Processing. - ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
General Notion of Realizability. December 1999.
Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
2000 lt/ENAMEXgt - ltENAMEX_TYPE"PERSON"gt D. Litman, D. Bhembe, C.
P. Rose, K. Forbes-Riley, S. Silliman, K.
VanLehn (2004). lt/ENAMEXgt Spoken Versus Typed
Human and Computer Dialogue Tutoring, Proceedings
of the Intelligent Tutoring Systems Conference.
16How NER can help?
- Provide system generic Patterns.
- AUTHOR M. Woszczyna, N. Aoki-Waibel, F. D. Buo,
N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A.
McNair, T. Polzin, I. Rogina, C. P. Rose, T.
Schultz, B. Suhm, M. Tomita, A. Waibel, 1994,
JANUS 93 - CONFERENCE International Conference on
Acoustics, Speech - Then use specific rules to refine
17Lessons LearnedAnother Boring Text Slide
- Semi-structured text is surprisingly difficult to
read - Off-line training for wrappers and/or NER may
help - Need very high-confidence rules to ensure
precision - A continuously-running system needs robustness
(internet/Google-failure, unexpected errors, )