Name Extraction from Chinese Novels - PowerPoint PPT Presentation

About This Presentation
Title:

Name Extraction from Chinese Novels

Description:

Given a Chinese novel, extract the names of people and locations ... Extract bigrams, trigrams, and quadrigrams from text. Run logistic regression on extracted ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 8
Provided by: BillMac9
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Name Extraction from Chinese Novels


1
Name Extraction from Chinese Novels
  • CS224n
  • Spring 2008
  • Jing Chen and Raylene Yung

2
Problem
  • Given a Chinese novel, extract the names of
    people and locations
  • Different from English NER no whitespace within
    sentences, no capitalization
  • Can use other characteristics since the domain is
    limited

3
System Outline
  • Extract bigrams, trigrams, and quadrigrams from
    text
  • Run logistic regression on extracted features to
    learn feature weights
  • Use weights to compute a score for each n-gram
  • Apply thresholding to limit the number of guessed
    names
  • Use word lists from word segmenter and dictionary
  • Compare output list to correct list for F1 score

4
Features
  • N-gram and segmented word counts
  • Ratio of count of n-gram to (n-1)-gram
  • Transliterated characters
  • Prefixes and suffixes
  • Segmented words and dictionary
  • Mutual information

5
Thresholding
  • Otsus method
  • Often used in image processing
  • Separates data into two classes, minimizing the
    variance within the classes
  • Does not depend on training data
  • F1 Maximization
  • Find the threshold on training data that
    maximizes F1 score
  • Use same threshold on test data

6
Results
  • No validation set, so chose a baseline set
  • Ablation tests show that the baseline chosen was
    non-optimal
  • Best individual scores

7
Conclusion
  • Most useful features
  • N-gram counts / frequency ratios (0.46F1 alone)
  • Varies depending on type of n-gram
  • Thresholding
  • Otsus method yielded better overall performance
  • Both methods had drawbacks
  • Future work
  • More rigorous feature set testing
  • Larger / cleaner data sets
Write a Comment
User Comments (0)
About PowerShow.com