Citation Extractor - PowerPoint PPT Presentation

About This Presentation
Title:

Citation Extractor

Description:

Citation Extractor. Nguyen Bach. Sue Ann Hong. Ben Lambert. AuthorOf(Author, Paper) ... AUTHOR = M. Woszczyna, N. Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi, T. ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 11
Provided by: scie5
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Citation Extractor


1
Citation Extractor
  • Nguyen Bach
  • Sue Ann Hong
  • Ben Lambert

2
Extraction Task
AuthorOf(Author, Paper) PublishedAt(Paper,
Conference) IsPaper, IsAuthor, IsConference
  • Citation ltPaper, Authors, Conferencegt
  • Pattern
  • regular expression

3
Method Outline
Web pages (HTML, text)
Query Search (WIT)
Extract Patterns using known citations
Extract Citations using new patterns
Citations
Page-specific Patterns
4
Query "multiple-goal recognition from low-level
signals " " Xiaoyong Chai" " Qiang Yang" "AAAI
2005 "
Page http//www.informatik.uni-trier.de/ley/db/i
ndices/a-tree/y/YangQiang.html
5
Finding New Citations
6
The Challenge Patterns
  • Beginning and the end
  • Start token? End token? HTML tags?
  • ?difficult to find length of token vs. general
  • NER?
  • These things should be talked about while viewing
    the previous slide
  • Are regexs sufficient? (but not really relevant
    for self-supervised learning)
  • Incorporating NER as a source of possible ENTITY
    marker?
  • Use like AUTHOR, TITLE, CONF but with
    probabilities/confidence values

7
System Spits Out
  • 6 seeds ? 60 citations
  • 36 of these (partial citations)
  • "Theory and Algorithms for Plan Merging " , "
    Ming Li"
  • "The Expected Value of Hierarchical
    Problem-Solving " , " Fahiem Bacchus"
  • "Handling feature interactions in
    process-planning "
  • 14 of these (partial strings)
  • "On D "
  • "On t " , " John Tromp", " Elizabeth Sweedyk", "
    Umest Vazirani"
  • "An L " , " Ronan Sleep"
  • "To D
  • No new conferences (end-token)

8
Bootstrapping, Short-Lived
  • Highly restrictive regexs
  • No recovery
  • More seeds and variety the better
  • Stupid Little Things
  • Mis-capitalization
  • Variations in titles (- vs. )
  • Etc, etc, etc

9
Why is this one hard?
10
Extensions Improvements
  • Less strict string matching
  • Not case and punctuation sensitive
  • Better boundary detection
  • Start/end tokens, HTML wrapper detection?
  • Better pattern construction
  • e.g. n authors not 2
  • NER
  • help find the right "window
  • A source of ENTITY marker
  • Use like AUTHOR, TITLE, CONF but with
    probabilities/confidence values
  • Evaluation with DBLP?

11
NER
  • Baseline model (News corpus)
  • ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
    Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
    T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
    Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
    Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
    Towards Spontaneous Speech Translation,
    Proceedings of the International Conference on
    Acoustics, Speech, and Signal Processing.
  • ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
    Topological Representation of the Lambda
    Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
    Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
    gt Comp. Sci. (2000), vol. 10, pp. 81--96.
    lt/ENAMEXgt
  • Adapted model (News citation corpus)
  • ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
    Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
    T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
    Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
    Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
    Towards Spontaneous Speech Translation,
    Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
    International Conference on Acoustics, Speech,
    lt/ENAMEXgt and Signal Processing.
  • ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
    General Notion of Realizability. December 1999.
    Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
    2000 lt/ENAMEXgt

12
NER
  • HMM-based Model (Bikels 99)
  • Baseline NER 94 F-score
  • Trained 1.1 million words in News and
    Broadcastnews domain
  • Apply Baseline Model to recognize
  • Author, Conference , Location

13
NER Example with Baseline Model
  • ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
    Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
    T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
    Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
    Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
    Towards Spontaneous Speech Translation,
    Proceedings of the International Conference on
    Acoustics, Speech, and Signal Processing.
  • ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
    General Notion of Realizability. December 1999.
    Proceedings of LICS 2000
  • ltENAMEX_TYPE"PERSON"gt S. Awodey. lt/ENAMEXgt
    Topological Representation of the Lambda
    Calculus. September ltENAMEX_TYPE"PERSON"gt 1998.
    Math. Struct. lt/ENAMEXgt in ltENAMEX_TYPE"LOCATION"
    gt Comp. Sci. (2000), vol. 10, pp. 81--96.
    lt/ENAMEXgt
  • Good at detecting Author names boundaries, but
    sometimes too aggressive.

14
Adaptation NER
  • Goals adapt baseline model to work better in
    citation domain.
  • Issue No training data.
  • A Solution Take 300 citations Run baseline
    model then recorrect them
  • Train multiply 300 citations by 10, then train
    adaptation model with broadcast news corpus.

15
NER Example with Adaptation Model
  • ltENAMEX_TYPE"PERSON"gt M. Woszczyna, N.
    Aoki-Waibel, F. D. Buo, N. Coccaro, K. Horiguchi,
    T. Kemp, A. Lavie, A. McNair, T. Polzin, I.
    Rogina, C. P. Rose, T. Schultz, B. Suhm, M.
    Tomita, A. Waibel, 1994, JANUS 93 lt/ENAMEXgt
    Towards Spontaneous Speech Translation,
    Proceedings of the ltENAMEX_TYPE"ORGANIZATION"gt
    International Conference on Acoustics, Speech,
    lt/ENAMEXgt and Signal Processing.
  • ltENAMEX_TYPE"PERSON"gt L. Birkedal. lt/ENAMEXgt A
    General Notion of Realizability. December 1999.
    Proceedings of ltENAMEX_TYPE"ORGANIZATION"gt LICS
    2000 lt/ENAMEXgt
  • ltENAMEX_TYPE"PERSON"gt D. Litman, D. Bhembe, C.
    P. Rose, K. Forbes-Riley, S. Silliman, K.
    VanLehn (2004). lt/ENAMEXgt Spoken Versus Typed
    Human and Computer Dialogue Tutoring, Proceedings
    of the Intelligent Tutoring Systems Conference.

16
How NER can help?
  • Provide system generic Patterns.
  • AUTHOR M. Woszczyna, N. Aoki-Waibel, F. D. Buo,
    N. Coccaro, K. Horiguchi, T. Kemp, A. Lavie, A.
    McNair, T. Polzin, I. Rogina, C. P. Rose, T.
    Schultz, B. Suhm, M. Tomita, A. Waibel, 1994,
    JANUS 93
  • CONFERENCE International Conference on
    Acoustics, Speech
  • Then use specific rules to refine

17
Lessons LearnedAnother Boring Text Slide
  • Semi-structured text is surprisingly difficult to
    read
  • Off-line training for wrappers and/or NER may
    help
  • Need very high-confidence rules to ensure
    precision
  • A continuously-running system needs robustness
    (internet/Google-failure, unexpected errors, )
Write a Comment
User Comments (0)
About PowerShow.com