Linguistics 187287 Week 7 - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Linguistics 187287 Week 7

Description:

Compounds: make new words from old. lighthouse, grasshopper. What are the properties of the coding system? How can people/computers produce/decode words? ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 91
Provided by: Franci65
Category:

less

Transcript and Presenter's Notes

Title: Linguistics 187287 Week 7


1
Linguistics 187/287 Week 7
FSTs and XLE Grammars Generation
  • Ron Kaplan and Tracy King

2
Regular Relations and Morphological
Analysis
3
Morphology The structure of words
  • Words have parts, parts code meaning
  • Inflections add agreement features
  • walked walk ed
  • move on foot Past
  • Derivations affixes change core meaning
  • intractable in tractable
  • not possible
  • Compounds make new words from old
  • lighthouse, grasshopper
  • What are the properties of the coding system?
  • How can people/computers produce/decode words?

4
Characterizing words
  • English inflection
  • Only 4 forms for (most) verbs, 2 for nouns
  • walk, walks, walked, walking girl,
    girls
  • Make a list (dictionary)
  • English derivation
  • Suffixes and prefixes are promiscuous
  • unredecontaminatability
  • A looonnng list
  • Other languages are worse
  • Spanish, French, Italian Richer inflection
    pronoun attachments
  • 98 forms for some French verbs, 300 forms
    for some Spanish verbs
  • Finnish much richer 18,000 forms for some
    verbs!
  • German, Swedish Productive compounding gives
    infinitely many words!
  • Lebensversicherungsgesellschaftsangestellter

Finite lists are impractical, impossibleCannot
characterize infinite sets
5
Word formation
  • Parts combine selectively

terror ize terrorize
lamp s lamps
walk able ity walkability
6
Morphological alternations
  • Sounds/spelling change when parts combine

fake ing faking Drop silent e
stop ing stopping Double consonant
  • Changes are systematic
  • baking driving dropping ripping lying
    dying
  • Appear with newly coined words
  • zake ? zaking zop ? zopping zie ?
    zying

Changes governed by general rules,not lists of
particular cases
7
Rules for morphological alternations
  • Context-sensitive rewriting

fake i ng faking
Delete ebut only in context of vowel suffix
  • A linguistic notation e ? ? / _ Vowel
  • General rule formalism ? ? ? / ? _ ?
  • Change a string ? to ?
  • but only when it appears between strings ?
    and ?

8
Rules apply in sequence
stop ing
?
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
9
Rule order matters
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
10
Elegant descriptions, but interpretation is
  • Complicated
  • Scan input string for an instance of ?
  • Look back to see if it is preceded by ?
  • Look ahead to see if it is followed by ?
  • If so, replace ? with ?
  • Feed result to next rule

? ? ? / ? _ ?
  • Counterintuitive
  • Most rules dont change most inputs
  • But all rules must be attempted--wasted effort
  • Asymmetric
  • Easy to produce words from parts
  • Decoding words into parts is much harder

11
Decoding is harder
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
12
Mathematical analysis Rules as relations
  • Effect of any rule R can be modeled by a
    relation Rel(R) the (infinite) set of Rs
    input/output pairs
  • Rel(e ? ?/_ Vowel)
    ltfakeing,fakinggt ltfakes,fakesgt ltstopped,
    stoppedgt ltwalked,walkedgt
    ltxyz,xyzgt, ltqrs,qrsgt
  • Theorem For any rewriting rule R, Rel(R) is
    a regular relation

13
Regular Relations and Finite-State Transducers
  • Defining properties of regular relations
  • ltx1, x2gt is a regular relation, for x1, x2 in
    ?, a, b, c
  • Suppose Rel1 and Rel2 are regular relations.
    Then
  • Concatenation
  • ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
    Rel2 is regular
  • Union
  • ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
    is regular
  • Arbitrary repetition
  • ltxxxx, yyyy ltx,ygt ? Rel1 is regular
  • Regular relations are computed by finite-state
    transducers

Final state
14
Regular Relations and Finite-State Transducers
  • Defining properties of regular relations
  • ltx1, x2gt is a regular relation, for x1, x2 in
    ?, a, b, c
  • Suppose Rel1 and Rel2 are regular relations.
    Then
  • Concatenation
  • ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
    Rel2 is regular
  • Union
  • ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
    is regular
  • Arbitrary repetition
  • ltxxxx, yyyy ltx,ygt ? Rel1 is regular
  • Regular relations are computed by finite-state
    transducers

15
Regular Relations and Finite-State Transducers
  • Defining properties of regular relations
  • ltx1, x2gt is a regular relation, for x1, x2 in
    ?, a, b, c
  • Suppose Rel1 and Rel2 are regular relations.
    Then
  • Concatenation
  • ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
    Rel2 is regular
  • Union
  • ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
    is regular
  • Arbitrary repetition
  • ltxxxx, yyyy ltx,ygt ? Rel1 is regular
  • Regular relations are computed by finite-state
    transducers

16
Rule transducers
(convention write xy for ltx,ygt)
aabb cc
  • FSTs can be created automatically from rules
  • FSTs may be complex, interpretation is simple
  • Move from state to state, matching input and
    producing output
  • Context requirements enforced by
    states/transitions
  • Symmetry of producing, decoding

17
Applying in sequence
Output of one rule/FST is input to next
tieing
? ? p / p_ Vowel
FST1
tieing
FST2
e ? ? / _ Vowel
tiing
FST3
i ? y / _ i
tying
FST4
? ?
tying
18
Composing transducers
  • Theorem
  • Let R be the relation computed by feeding the
    output of transducer FST1 as input to FST2.
    Then there is another transducer FST3 that
    computes R in a single step.

input
FST1
R
FST3
FST2
output
  • Corollary The effect of composing any finite
    sequence of FSTs can be modeled by a single one

19
Example
e ? ? / _ Vowel
ee
? ?
ee
aabb cc
?
20
Summary
  • Rewriting rules can give elegant descriptions of
    morphological alternations
  • Rules are difficult to interpret, inefficient for
    decoding
  • Mathematical properties
  • Every rule denotes a regular relation, has an
    equivalent FST
  • There is a single FST equivalent to every
    FST/rule sequence
  • Finite-state techniques bridge between
  • Elegant linguistic description
  • Efficient, intuitive computation
  • Finite-state techniques bridge between
  • Theory
  • Practice
  • Morphological recognition and generation, spell
    checkers, search engines, indexing,
    character/handwriting recognition

21
FSTs in XLE grammars
  • FSTs are used for
  • tokenization
  • morphological analysis
  • Incorporated via the MORPHCONFIG

22
FST Morphologies
  • Associate a surface form of a word with a
    canonical form (lemma, stem) and a set of tags
  • Tags give grammatical information
  • Part of speech
  • Other information (number, tense, etc)
  • Tags may give additional information
  • Classes of proper nouns (names, locations)

23
Examples English
  • went go "Verb" "PastTense" "123SP"
  • boxes box "Noun" "Pl"
  • "Verb" "Pres" "3sg"
  • Mary "Prop" "Giv" "Fem" "Sg"
  • him he "Pron" "Pers" "Acc" "3P" "Sg"

24
Examples French
  • fleur fleur "Fem" "SG" "Noun"
  • venir venir "Inf" "Verb"
  • vienne venir "SubjP" "SG" "P1""P3"
    "Verb"
  • tour tour "Masc""Fem" "SG" "Noun"
  • France France "Fem" "InvPL" "Country

    "Proper" "Noun"

25
Tokenization
  • Tokenization breaks up a string (sentence) into
    tokens (words)
  • Break off punctuation
  • Break off clitics
  • Lowercasing
  • Allow for markup

26
Punctuation and clitics
  • Simple breaking
  • I see them. gt I TB see TB them TB . TB
  • The dog, a poodle, gt The TB dog TB , TB a TB
    poodle TB , TB
  • Haplology
  • Find the dog, Muffy. gt Find TB the TB dog TB ,
    TB Muffy TB , TB . TB
  • Go to Palm Dr. gt Go TB to TB Palm TB Dr. TB .
    TB
  • Clitics
  • Ill go. gt I TB ll TB go TB . TB

27
Punctuation Problems
  • When to break off punctuation is not always clear
  • Hyphens part of the word or separate
    punctuation?
  • a six-year-old boy
  • a windshield-wiper blade cleaner
  • The dog - a poodle - barked.

28
Lowercasing
  • Need to (optionally) lowercase in certain
    positions (depends on the language)
  • Sentence initially
  • The boy left. gt the boy left.
  • Mary left. gt Mary left.
  • After colons
  • The boy left He was unhappy. gt The boy left
    he was unhappy.
  • All caps
  • Do NOT leave. gt do not leave.
  • IBM did well. gt IBM did well.

29
Tokenizers are non-deterministic
  • Allow for multiple tokenizations to guarantee
    correct one
  • Bush saw them. gt Bush bush TB saw TB them
    TB , TB . TB
  • May include markup
  • All caps lowering marked
  • IBM gt IBM ibm

30
An example
  • String Children came.
  • Tokens Children children TB came TB ,
    TB . TB
  • Morphology (for the tokens we want)
  • child Noun Pl children Token
  • come Verb PastTense 123SP came
    Token
  • . Punct Sent . Token
  • Outputs from tokenizer and morphology fsts can
    multiply out

31
Allowing for markup
  • Normal rules of tokenization (lowercasing,
    haplology) need to skip markup
  • The markup should not be broken up like regular
    punctuation
  • labeled bracketing
  • I see \NP the dog, a poodle\.
  • named entities
  • ltpersongtMr. Smithlt/persongt

32
The process in XLE
XLE words
words
33
Viewing the analysis in XLE
  • If a FST tokenizer is loaded with the grammar
  • tokens Ill try this string.
  • If a FST morphology is loaded with the grammar
  • morphemes testing
  • These results are also visible in the morph
    window (from the c-structure window options)

34
Using FSTs with the grammar
  • Tokenize the string
  • Children came. gt children TB came TB . TB
  • Run the tokens through the morphology
  • child Noun Pl come Verb PastTense 123SP .
    Punct Sent
  • Parse the lemmas and the tags
  • sublexical rules build up the words
  • regular rules build the words into phrases
  • each tag has a lexical entry

35
Lexical entries for stems and tags
  • Like the lexical entries you have seen, only with
    XLE instead of
  • boy N XLE _at_(NOUN boy).
  • Noun N_POS_SFX XLE _at_(PERS 3).
  • Sg N_SFX XLE _at_(NUM sg).
  • Pl N_SFX XLE _at_(NUM pl).
  • Note no entry for boys
  • matches tokens that dont go through FST, XLE
    matches FST output stems

36
Sublexical rules
  • Want to insert rules between the lexical
    categories (e.g. N) and the same category in the
    lexicon
  • But the lexical category only identifies the stem
    or base
  • Sublexical rules combine the base with the
    inflectional tags
  • So, build a category (N) from the base (N_BASE)

37
Sublexical rules cont.
  • Like lexical rules only
  • Add _BASE to the category in the lexicon
  • boy N Noun N_POS_SFX Sg N_SFX
  • Example
  • N --gt N_BASE
  • N_POS_SFX_BASE
  • N_SFX_BASE.
  • When parsing, the sublexical trees are not shown.
    Right click on the leave node (e.g., N) and
    choose "show morphemes" to see them.

38
NP example tree
39
Sublexical rules cont.
  • A --gt A_BASE
  • A_POS_SFX_BASE
  • (A_SFX_BASE). optionality
  • N --gt N_BASE disjunction
  • N_POS_SFX_BASE
  • N_SFX_BASE
  • VN_BASE
  • V_POS_SFX_BASE
  • V_SFX_BASE. kleene star

40
Using the -unknown entry
  • Words with predictable subcat frames can go
    through the special entry -unknown
  • The tags will constrain the distribution
  • This avoids having to list all adverbs,
    adjectives, nouns, etc.
  • stem picks up the lemma/stem
  • -unknown ADJ XLE _at_(ADJ stem)
  • N XLE _at_(NOUN stem)
  • ADV XLE _at_(ADVERB stem).

41
Lexicon and -unknown
  • Verbs ought to be listed due to their subcat
    frames
  • Idiosyncratic entries for nouns, etc. need to be
    listed
  • But, avoid duplicating the word done by the FST
    morphology in the lexicon--mapping to categories
    done in only one place

42
FST guessers
  • The morphologies are good, but dont have all
    words
  • FST guessers can be written
  • work best for languages with lots of morphology
  • for English
  • -ed can be a verb or adjective
  • -ing can be a verb, noun, or adjective
  • -s can be a plural noun or 3sg verb
  • words starting with capitals can be proper nouns
  • etc.

43
Using multiple FSTs
  • How FSTs are used is declared in the MORPHCONFIG
  • The toy grammars use a default MORPHCONFIG
  • TOKENIZE and ANALYZE sections
  • Sections to specify
  • where the fsts are
  • how to treat multiword expressions

44
Example MORPHCONFIG
  • STANDARD ENGLISH MORPHOLOGY (1.0)
  • TOKENIZE
  • whitespace.fst tokenizer.fst
  • ANALYZE USEFIRST
  • main-morphology.fst
  • english-guesser.fst
  • ANALYZE USEALL
  • eureka-numbers.fst
  • eureka-novel-nouns.txt
  • ----

45
Morphconfig cont.
  • TOKENIZE
  • whitespace.fst tokenizer.fst
  • The fsts listed are composed output of first is
    input to second, etc.
  • Having multiple fsts
  • may avoid problems with large
    compositions
  • allows for modularity

46
Morphconfig cont.
  • ANALYZE USEFIRST
  • main-morphology.fst
  • english-guesser.fst
  • Take as input the individual tokens from the
    tokenizer
  • Apply the analyzers one by one until an analysis
    is found. Once an analysis is found, it stops.
  • Effect of the above example
  • first try to find the analysis in the main
    morphology
  • if that fails, guess the morphological analysis

47
Morphconfig cont.
  • ANALYZE USEALL
  • eureka-numbers.fst
  • eureka-novel-nouns.fst
  • Each morphological analyzer is applied to the
    string, produces union of results
  • In the example, if a string could be both a
    eureka number and a eureka novel noun, it will
    get both analyses
  • It is not necessary to have both USEALL and
    USEFIRST sections.

48
FST/XLE main points
  • XLE allows the incorporation of FSTS through the
    MORPHCONFIG
  • Tokenizers, including special markup, and
    morphological analyzers can be included
  • Large morphological analyzers in conjunction with
    sublexical rules and the unknown lexical item
    reduce the need for lexicon development

49
Integrating Shallow Mark up Part of speech
tags Named entities Syntactic brackets
50
Shallow mark-up of input strings
  • Part-of-speech tags (tagger?)
  • I/PRP saw/VBD her/PRP duck/VB.
  • I/PRP saw/VBD her/PRP duck/NN.
  • Named entities (named-entity recognizer)
  • ltpersongtGeneral Millslt/persongt bought it.
  • ltcompanygtGeneral Millslt/companygt bought it
  • Syntactic brackets (chunk parser?)
  • NP-S I saw NP-O the girl with the
    telescope.
  • NP-S I saw NP-O the girl with the
    telescope.

51
Hypothesis
  • Shallow mark-up
  • Reduces ambiguity
  • Increases speed
  • Without decreasing accuracy
  • (Helps development)
  • Issues
  • Markup errors may eliminate correct analyses
  • Markup process may be slow
  • Markup may interfere with existing robustness
    mechanisms (optimality, fragments, guessers)
  • Backoff may restore robustness but decrease speed
    in 2-pass system (STOPPOINT)

52
Implementation in XLE
How to integrate with minimal changes to existing
system/grammar?
53
XLE String Processing
lexical forms
Multiwords
Modify sequences
token morphemes
Morph,Guess, Tok
Analyze
tokens
Tthe TB oil TB filter TB s TB gone TB
Decap, split, commas
Tokenize
string
The oil filters gone
54
Part of speech tags
lexical forms
Multiwords
token morphemes
Analyze
  • How do tags pass thru Tokenize/Analyze?
  • Which tags constrain which morphemes?
  • How?

tokens
Tokenize
string
The/DET_ oil/NN_ filter/NN_s/VBZ_
gone/VBN_
55
Passing tags through Tokenizer
  • Tokenizer must treat tag characters specially
  • Must recognize them e.g. xxx/TAG_
  • Must not transform them e.g. x/NN_ ? x/nn_
  • Must not let tags interrupt other patterns
  • e.g. wo/MD_nt/RB_ should behave like
    wont
  • Must split tags off as separate tokens, for
    existing Token path through Analyzer
  • How to do this with minimal changes to existing
    tokenizer FST?

tokens
Tokenize
string
56
Modifying an existing tokenizer
  • Tags shouldnt be transformed
  • Tags shouldnt disrupt any other patterns

Script for xfst program Tokenizer Tag
.o. Tokenizer/Tag
Dont transform
Dont disrupt
Glitch Ignore (/) introduces unwanted ambiguity
around insertions
Solution, a little less modularity Construct
Tokenizer using cover symbol for tags, placing
them wrt insertion Substitute actual
tag-strings for cover symbol
57
Specifying morpheme/pos-tag constraints
  • For each pos-tag, grammar/morphology writer
    specifies by hand the set of compatible morph-tag
    sequences
  • Inputs Description of pos-tag interpretation (
    from Penn document)
  • List of all possible morph-tag sequences from
    analyzer (from program run on Morph/Guesser
    FSTs)
  • Output A text file that characterizes the
    relationship
  • E.g. NNS is plural noun, so text file has
  • (NNS ( Noun Pl) (Noun SP) ( Abbr) )
  • PRP is personal pronoun, so text file has
  • (PRP ( Pron Pers Gen) (Pron Poss) )
  • Lisp program reads file, produces POSFilter
    transducer
  • Allows NNS_ Token sequence only if preceded by
    strings that contain
  • Noun and PL tags, or Noun and SP tags, etc.
  • POSFilter FST is put in MULTIWORD section,
    knocks out undesired morpheme sequences.

58
All together
lexical forms
Multiwords
POSFilterFST
token morphemes
Analyze
tokens
Tokenize
Tokenize
POSStringFST
string
59
MORPHCONFIG
  • STANDARD ENGLISH MORPHOLOGY (1.0)
  • TOKENIZE
  • ../common/englishpostags.stringfst
    ../common/english.tok.parse.fst
  • ANALYZE
  • ../common/english.infl.fst
  • ../common/english.morph.guesser.fst
  • MULTIWORD
  • ../common/eng-infl-final.posfilterfst
  • BuildMultiwordsFromLexicon
  • Tag Prefer
  • BuildMultiwordsFromMorphology
  • Tag Prefer

60
Named entities Example input
  • parse ltpersongtMr. Thejskt Thejslt/persongt
    arrived.
  • tokenized string
  • Mr. Thejskt Thejs TB NEperson Mr(TB). TB
    Thejskt TB Thejs

. (.) TB (, TB) .
TB arrived
TB
61
Lexicon
  • Lexical entries for tags
  • NEperson NE_SFX _at_(PROPER name).
  • Lexical entry for token
  • -token TOKEN ( TOKEN)stem
  • NE _at_(NOUN stem)
  • _at_(GRAIN proper)
  • _at_(SOURCE entity-finder)
  • _at_(OT-MARK NamedEntity).

62
Grammar Rules
  • Rules
  • NOUN-ENTITY --gt NE NE_SFX.
  • NOUN --gt
  • _at_NOUN-ENTITY.
  • Config OT Mark
  • (MWE NamedEntity) STOPPOINT

63
Resulting C-structure
64
Resulting F-structure
65
Generation
  • Parsing string to analysis
  • Generation analysis to string
  • What type of input?
  • How to generate

66
Why generate?
  • Machine translation
  • Lang1 string -gt Lang1 fstr -gt Lang2 fstr -gt Lang2
    string
  • Sentence condensation
  • Long string -gt fstr -gt smaller fstr -gt new string
  • Question answering
  • Production of NL reports
  • State of machine or process
  • Explanation of logical deduction
  • Grammar debugging

67
F-structures as input
  • Use f-structures as input to the generator
  • May parse sentences that shouldnt be generated
  • May want to constrain number of generated options
  • Input f-structure may be underspecified

68
XLE generator
  • Use the same grammar for parsing and generation
  • Advantages
  • maintainability
  • write rules and lexicons once
  • But
  • special generation tokenizer
  • different OT ranking

69
Generation tokenizer
  • White space
  • Parsing multiple white space becomes a single TB
  • John appears. -gt John TB appears TB . TB
  • Generation single TB becomes a single space (or
    nothing)
  • John TB appears TB . TB -gt John appears.

  • John appears .

70
Generation tokenizer
  • Capitalization
  • Parsing optionally decap initially
  • They came -gt they came
  • Mary came -gt Mary came
  • Generation always capitalize initially
  • they came -gt They came
  • they came
  • May regularize other options
  • quotes, dashes, etc.

71
Generation morphology
  • Suppress variant forms
  • Parse both favor and favour
  • Generate only one

72
Morphconfig for parsing generation
  • STANDARD ENGLISH MOPRHOLOGY (1.0)
  • TOKENIZE
  • P!eng.tok.parse.fst G!eng.tok.gen.fst
  • ANALYZE
  • eng.infl-morph.fst G!amerbritfilter.fst
  • G!amergen.fst
  • ----

73
Reversing the parsing grammar
  • The parsing grammar can be used directly as a
    generator
  • Adapt the grammar with a special OT ranking
    GENOPTIMALITYORDER
  • Why do this?
  • parse ungrammatical input
  • have too many options

74
Ungrammatical input
  • Linguistically ungrammatical
  • They walks.
  • They ate banana.
  • Stylistically ungrammatical
  • No ending punctuation They appear
  • Superfluous commas John, and Mary appear.
  • Shallow markup NP John and Mary appear.

75
Too many options
  • All the generated options can be linguistically
    valid, but too many for applications
  • Occurs when more than one string has the same,
    legitimate f-structure
  • PP placement
  • In the morning I left. I left in the morning.

76
Using the Gen OT ranking
  • Generally much simpler than in the parsing
    direction
  • Usually only use standard marks and NOGOOD
  • no marks, no STOPPOINT
  • Can have a few marks that are shared by several
    constructions
  • one or two for disprefered
  • one or two for prefered

77
Example Comma in coord
  • COORD(_CAT) _CAT _at_CONJUNCT
  • (COMMA _at_(OTMARK
    GenBadPunct))
  • CONJ
  • _CAT _at_CONJUNCT.
  • GENOPTIMALITYORDER GenBadPunct NOGOOD.
  • parse They appear, and disappear.
  • generate without OT They appear(,) and
    disappear.
  • with OT They appear and
    disappear.

78
Example Prefer initial PP
  • S --gt (PP _at_ADJUNCT _at_(OT-MARK GenGood))
  • NP _at_SUBJ
  • VP.
  • VP --gt V
  • (NP _at_OBJ)
  • (PP _at_ADJUNCT).
  • GENOPTIMALITYORDER NOGOOD GenGood.
  • parse they appear in the morning.
  • generate without OT In the morning they appear.
  • They appear
    in the morning.
  • with OT In the morning they
    appear.

79
Generation commands
  • XLE command line
  • regenerate "They appear."
  • generate-from-file my-file.pl
  • (regenerate-from-directory, regenerate-testfile)
  • F-structure window
  • commands generate from this fs
  • Debugging commands
  • regenerate-morphemes

80
Debugging the generator
  • When generating from an f-structure produced by
    the same grammar, XLE should always generate
  • Unless
  • OT marks block the only possible string
  • something is wrong with the tokenizer/morphology
  • regenerate-morphemes if this gets a
    string
  • the tokenizer/morphology is not the
    problem
  • Very hard to debug newest XLE has robustness
    features to help

81
Underspecified Input
  • F-structures provided by applications are not
    perfect
  • may be missing features
  • may have extra features
  • may simply not match the grammar coverage
  • Missing and extra features are often systematic
  • specify in XLE which features can be added and
    deleted
  • Not matching the grammar is a more serious problem

82
Adding features
  • English to French translation
  • English nouns have no gender
  • French nouns need gender
  • Soln have XLE add gender
  • the French morphology will control
    the value
  • Specify additions in xlerc
  • set-gen-adds add "GEND"
  • can add multiple features
  • set-gen-adds add "GEND CASE PCASE"
  • XLE will optionally insert the feature

Note Unconstrained additions make generation
undecidable
83
Example
The cat sleeps. -gt Le chat dort.
  • PRED 'dormirltSUBJgt'
  • SUBJ PRED 'chat'
  • NUM sg
  • SPEC def
  • TENSE present

PRED 'dormirltSUBJgt' SUBJ PRED 'chat'
NUM sg GEND masc
SPEC def TENSE present
84
Deleting features
  • French to English translation
  • delete the GEND feature
  • Specify deletions in xlerc
  • set-gen-adds remove "GEND"
  • can remove multiple features
  • set-gen-adds remove "GEND CASE PCASE"
  • XLE obligatorily removes the features
  • no GEND feature will remain in the f-structure
  • if a feature takes an f-structure value, that
    f-structure is also removed

85
Changing values
  • If values of a feature do not match between the
    input f-structure and the grammar
  • delete the feature and then add it
  • Example case assignment in translation
  • set-gen-adds remove "CASE"
  • set-gen-adds add "CASE"
  • allows dative case in input to become accusative
  • e.g., exceptional case marking verb in input
    language but regular case in output language

86
Creating Paradigms
  • Deleting and adding features within one grammar
    can produce paradigms
  • Specifiers
  • set-gen-adds remove "SPEC"
  • set-gen-adds add "SPEC DET DEMON"
  • regenerate "NP boys"
  • the those these boys

87
Generation for Debugging
  • Checking for grammar and lexicon errors
  • create-generator english.lfg
  • reports ill-formed rules, templates, feature
    declarations, lexical entries
  • Checking for ill-formed sentences that can be
    parsed
  • parse a sentence
  • see if all the results are legitimate strings
  • regenerate they appear.

88
Regeneration example
  • regenerate "In the park they often see the boy
    with the telescope."
  • parsing In the park they often see the boy with
    the telescope.
  • 4 solutions, 0.39 CPU seconds, 178 subtrees
    unified
  • They see the boy in the parkIn the park they
    see the boy often with the telescope.
  • regeneration took 0.87 CPU seconds.

89
Regenerate testfile
  • regenerate-testfile
  • produces new file testfile.regen
  • sentences with parses and generated strings
  • lists sentences with no strings
  • if have no Gen OT marks, everything should
    generate back to itself

90
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com