A Strategy for Automatic Corpus Generation - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

A Strategy for Automatic Corpus Generation

Description:

... A VP that could be inflected by a 3rd person Singular subject. Inflected for things inside the phrase. Example: ... Inflect everything that needs to be inflected ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 15
Provided by: nos8
Category:

less

Transcript and Presenter's Notes

Title: A Strategy for Automatic Corpus Generation


1
A Strategy for Automatic Corpus Generation
  • Alison Alvarez
  • Language Technologies Institute
  • Carnegie Mellon University

2
Research Background
  • If you encountered a new language, what question
    would you ask first?
  • Answer Does it have a bilingual aligned corpus?
  • Our solution for minority languages add one
    bilingual consultant and the Elicitation Tool

3
The Elicitation Tool
4
How We Represent Sentences
  • Main Components
  • Vector of Head Containing Phrases
  • Tree Structure
  • Surface Text

5
Feature Structures
  • What is a Feature?
  • What is a Feature Structure?
  • One for the pronoun she
  • What are they used for?
  • Corpus Navigation
  • Sentence Generation
  • Feature Detection (Next Slide)

6
Feature Detection
  • Compare sentences with highly similar feature
    vectors
  • Note the differences in translation
  • Big problem Combinatorics

7
The Generation Process Resources
  • Mini-Copora of Phrasal Components
  • Includes all possible feature combinations
  • Lists Subcategorizations
  • A Phrase that will be put in the sentence
  • Uninflected for things outside of phrase
  • Example A VP that could be inflected by a 3rd
    person Singular subject
  • Inflected for things inside the phrase
  • Example Most plural nouns
  • A predicate field for the head
  • A Comment Field
  • Can itself be generated

8
The Generation Process Resources
  • A set of tree structures with substitution sites
  • Each site has a list of selectional restrictions
  • A transitive sentence
  • S(NPSubj VP(VS NPObj))
  • VS verb sequence
  • Also includes an subcategorization description
  • The subcategorization will be used so that the
    right trees get put with the right verbs, etc

9
The Generation Process Resources
  • The Lexicon
  • The most of the lexical description will come
    from the feature vector
  • Will simply list selectional restrictions
  • Example for to paint
  • Subject Human, Object Inanimate
  • Can vary based on tree structure if needed
  • Example for to be
  • Subject Gender Object Gender, Subject Number
    Object Number, Subject Animacy Object
    Animacy

10
The Generation Process Resources
  • A collection of word inflections
  • the original uninflected word, not always in
    infinitive form
  • its inflections
  • And their corresponding circumstances
  • Organized by part of speech
  • Examples
  • Have Subject 3rd Singular has
  • Is Subj 1st sg am, Subj pl are, Subj 2nd
    are
  • Mouse pl mice
  • Most nouns will already be inflected

11
The Generation Process Algorithm
  • Start with the Mini-Corpus for Verb Sequences
  • For each verb, select the appropriate trees
  • Replace the VS tag with the uninflected verb if
    it is not restricted
  • For each partially filled in tree
  • Fill in the appropriate slots with words that can
    unify
  • Eliminate all that dont meet the selectional
    restrictions of its mother

12
The Generation Process Algorithm
  • Once all of the Phrases are filled in
  • Create a new sentence structure
  • Create full phrasal vector
  • Concatenate phrases together for surface text
  • Inflect everything that needs to be inflected
  • Match up head words with words from head word
    collection
  • Only checks for verb sequence heads in current
    implementation
  • Check for matching circumstances
  • Replace old word with inflected word

13
Examples
  • She paints it
  • She is a mother
  • She is an idea
  • Paint the house
  • I think that she likes Bill
  • I know to run
  • The implementation

14
Current Problems/Future Plans
  • How to do long distance attachment
  • Fluency/Lexical Matching She went up a ladder
    and up an aardvark
  • Dont put in so many words
  • Write a lot of rules
  • Make sure every combination is covered despite
    selectional restrictions
  • Expand the Implementation
Write a Comment
User Comments (0)
About PowerShow.com