Linguistics 187287 Week 7

About This Presentation

Title:

Linguistics 187287 Week 7

Description:

Compounds: make new words from old. lighthouse, grasshopper. What are the properties of the coding system? How can people/computers produce/decode words? ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 91

Provided by: Franci65

Category:

more less

Transcript and Presenter's Notes

Title: Linguistics 187287 Week 7

1
Linguistics 187/287 Week 7
FSTs and XLE Grammars Generation

Ron Kaplan and Tracy King

2
Regular Relations and Morphological
Analysis
3
Morphology The structure of words

Words have parts, parts code meaning

Inflections add agreement features
walked walk ed
move on foot Past

Derivations affixes change core meaning
intractable in tractable
not possible

Compounds make new words from old
lighthouse, grasshopper

What are the properties of the coding system?
How can people/computers produce/decode words?

4
Characterizing words

English inflection
Only 4 forms for (most) verbs, 2 for nouns
walk, walks, walked, walking girl,
girls
Make a list (dictionary)
English derivation
Suffixes and prefixes are promiscuous
unredecontaminatability
A looonnng list
Other languages are worse
Spanish, French, Italian Richer inflection
pronoun attachments
98 forms for some French verbs, 300 forms
for some Spanish verbs
Finnish much richer 18,000 forms for some
verbs!
German, Swedish Productive compounding gives
infinitely many words!
Lebensversicherungsgesellschaftsangestellter

Finite lists are impractical, impossibleCannot
characterize infinite sets
5
Word formation

Parts combine selectively

terror ize terrorize
lamp s lamps
walk able ity walkability
6
Morphological alternations

Sounds/spelling change when parts combine

fake ing faking Drop silent e
stop ing stopping Double consonant

Changes are systematic
baking driving dropping ripping lying
dying

Appear with newly coined words
zake ? zaking zop ? zopping zie ?
zying

Changes governed by general rules,not lists of
particular cases
7
Rules for morphological alternations

Context-sensitive rewriting

fake i ng faking
Delete ebut only in context of vowel suffix

A linguistic notation e ? ? / _ Vowel

General rule formalism ? ? ? / ? _ ?
Change a string ? to ?
but only when it appears between strings ?
and ?

8
Rules apply in sequence
stop ing
?
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
9
Rule order matters
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
10
Elegant descriptions, but interpretation is

Complicated
Scan input string for an instance of ?
Look back to see if it is preceded by ?
Look ahead to see if it is followed by ?
If so, replace ? with ?
Feed result to next rule

? ? ? / ? _ ?

Counterintuitive
Most rules dont change most inputs
But all rules must be attempted--wasted effort
Asymmetric
Easy to produce words from parts
Decoding words into parts is much harder

11
Decoding is harder
? ? p / p_ Vowel
e ? ? / _ Vowel
i ? y / _ i
? ?
12
Mathematical analysis Rules as relations

Effect of any rule R can be modeled by a
relation Rel(R) the (infinite) set of Rs
input/output pairs
Rel(e ? ?/_ Vowel)
ltfakeing,fakinggt ltfakes,fakesgt ltstopped,
stoppedgt ltwalked,walkedgt
ltxyz,xyzgt, ltqrs,qrsgt

Theorem For any rewriting rule R, Rel(R) is
a regular relation

13
Regular Relations and Finite-State Transducers

Defining properties of regular relations
ltx1, x2gt is a regular relation, for x1, x2 in
?, a, b, c
Suppose Rel1 and Rel2 are regular relations.
Then
Concatenation
ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
Rel2 is regular
Union
ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
is regular
Arbitrary repetition
ltxxxx, yyyy ltx,ygt ? Rel1 is regular

Regular relations are computed by finite-state
transducers

Final state
14
Regular Relations and Finite-State Transducers

Defining properties of regular relations
ltx1, x2gt is a regular relation, for x1, x2 in
?, a, b, c
Suppose Rel1 and Rel2 are regular relations.
Then
Concatenation
ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
Rel2 is regular
Union
ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
is regular
Arbitrary repetition
ltxxxx, yyyy ltx,ygt ? Rel1 is regular

Regular relations are computed by finite-state
transducers

15
Regular Relations and Finite-State Transducers

Defining properties of regular relations
ltx1, x2gt is a regular relation, for x1, x2 in
?, a, b, c
Suppose Rel1 and Rel2 are regular relations.
Then
Concatenation
ltx1x2,y1y2gt ltx1,y1gt ? Rel1, ltx2,y2gt ?
Rel2 is regular
Union
ltx,ygt ltx,ygt ? Rel1 or ltx,ygt ? Rel2
is regular
Arbitrary repetition
ltxxxx, yyyy ltx,ygt ? Rel1 is regular

Regular relations are computed by finite-state
transducers

16
Rule transducers
(convention write xy for ltx,ygt)
aabb cc

FSTs can be created automatically from rules
FSTs may be complex, interpretation is simple
Move from state to state, matching input and
producing output
Context requirements enforced by
states/transitions
Symmetry of producing, decoding

17
Applying in sequence
Output of one rule/FST is input to next
tieing
? ? p / p_ Vowel
FST1
tieing
FST2
e ? ? / _ Vowel
tiing
FST3
i ? y / _ i
tying
FST4
? ?
tying
18
Composing transducers

Theorem
Let R be the relation computed by feeding the
output of transducer FST1 as input to FST2.
Then there is another transducer FST3 that
computes R in a single step.

input
FST1
R
FST3
FST2
output

Corollary The effect of composing any finite
sequence of FSTs can be modeled by a single one

19
Example
e ? ? / _ Vowel
ee
? ?
ee
aabb cc
?
20
Summary

Rewriting rules can give elegant descriptions of
morphological alternations
Rules are difficult to interpret, inefficient for
decoding
Mathematical properties
Every rule denotes a regular relation, has an
equivalent FST
There is a single FST equivalent to every
FST/rule sequence
Finite-state techniques bridge between
Elegant linguistic description
Efficient, intuitive computation

Finite-state techniques bridge between
Theory
Practice
Morphological recognition and generation, spell
checkers, search engines, indexing,
character/handwriting recognition

21
FSTs in XLE grammars

FSTs are used for
tokenization
morphological analysis
Incorporated via the MORPHCONFIG

22
FST Morphologies

Associate a surface form of a word with a
canonical form (lemma, stem) and a set of tags
Tags give grammatical information
Part of speech
Other information (number, tense, etc)
Tags may give additional information
Classes of proper nouns (names, locations)

23
Examples English

went go "Verb" "PastTense" "123SP"
boxes box "Noun" "Pl"
"Verb" "Pres" "3sg"
Mary "Prop" "Giv" "Fem" "Sg"
him he "Pron" "Pers" "Acc" "3P" "Sg"

24
Examples French

fleur fleur "Fem" "SG" "Noun"
venir venir "Inf" "Verb"
vienne venir "SubjP" "SG" "P1""P3"
"Verb"
tour tour "Masc""Fem" "SG" "Noun"
France France "Fem" "InvPL" "Country

"Proper" "Noun"

25
Tokenization

Tokenization breaks up a string (sentence) into
tokens (words)
Break off punctuation
Break off clitics
Lowercasing
Allow for markup

26
Punctuation and clitics

Simple breaking
I see them. gt I TB see TB them TB . TB
The dog, a poodle, gt The TB dog TB , TB a TB
poodle TB , TB
Haplology
Find the dog, Muffy. gt Find TB the TB dog TB ,
TB Muffy TB , TB . TB
Go to Palm Dr. gt Go TB to TB Palm TB Dr. TB .
TB
Clitics
Ill go. gt I TB ll TB go TB . TB

27
Punctuation Problems

When to break off punctuation is not always clear
Hyphens part of the word or separate
punctuation?
a six-year-old boy
a windshield-wiper blade cleaner
The dog - a poodle - barked.

28
Lowercasing

Need to (optionally) lowercase in certain
positions (depends on the language)
Sentence initially
The boy left. gt the boy left.
Mary left. gt Mary left.
After colons
The boy left He was unhappy. gt The boy left
he was unhappy.
All caps
Do NOT leave. gt do not leave.
IBM did well. gt IBM did well.

29
Tokenizers are non-deterministic

Allow for multiple tokenizations to guarantee
correct one
Bush saw them. gt Bush bush TB saw TB them
TB , TB . TB
May include markup
All caps lowering marked
IBM gt IBM ibm

30
An example

String Children came.
Tokens Children children TB came TB ,
TB . TB
Morphology (for the tokens we want)
child Noun Pl children Token
come Verb PastTense 123SP came
Token
. Punct Sent . Token
Outputs from tokenizer and morphology fsts can
multiply out

31
Allowing for markup

Normal rules of tokenization (lowercasing,
haplology) need to skip markup
The markup should not be broken up like regular
punctuation
labeled bracketing
I see \NP the dog, a poodle\.
named entities
ltpersongtMr. Smithlt/persongt

32
The process in XLE
XLE words
words
33
Viewing the analysis in XLE

If a FST tokenizer is loaded with the grammar
tokens Ill try this string.
If a FST morphology is loaded with the grammar
morphemes testing
These results are also visible in the morph
window (from the c-structure window options)

34
Using FSTs with the grammar

Tokenize the string
Children came. gt children TB came TB . TB
Run the tokens through the morphology
child Noun Pl come Verb PastTense 123SP .
Punct Sent
Parse the lemmas and the tags
sublexical rules build up the words
regular rules build the words into phrases
each tag has a lexical entry

35
Lexical entries for stems and tags

Like the lexical entries you have seen, only with
XLE instead of
boy N XLE _at_(NOUN boy).
Noun N_POS_SFX XLE _at_(PERS 3).
Sg N_SFX XLE _at_(NUM sg).
Pl N_SFX XLE _at_(NUM pl).
Note no entry for boys
matches tokens that dont go through FST, XLE
matches FST output stems

36
Sublexical rules

Want to insert rules between the lexical
categories (e.g. N) and the same category in the
lexicon
But the lexical category only identifies the stem
or base
Sublexical rules combine the base with the
inflectional tags
So, build a category (N) from the base (N_BASE)

37
Sublexical rules cont.

Like lexical rules only
Add _BASE to the category in the lexicon
boy N Noun N_POS_SFX Sg N_SFX
Example
N --gt N_BASE
N_POS_SFX_BASE
N_SFX_BASE.
When parsing, the sublexical trees are not shown.
Right click on the leave node (e.g., N) and
choose "show morphemes" to see them.

38
NP example tree
39
Sublexical rules cont.

A --gt A_BASE
A_POS_SFX_BASE
(A_SFX_BASE). optionality
N --gt N_BASE disjunction
N_POS_SFX_BASE
N_SFX_BASE
VN_BASE
V_POS_SFX_BASE
V_SFX_BASE. kleene star

40
Using the -unknown entry

Words with predictable subcat frames can go
through the special entry -unknown
The tags will constrain the distribution
This avoids having to list all adverbs,
adjectives, nouns, etc.
stem picks up the lemma/stem
-unknown ADJ XLE _at_(ADJ stem)
N XLE _at_(NOUN stem)
ADV XLE _at_(ADVERB stem).

41
Lexicon and -unknown

Verbs ought to be listed due to their subcat
frames
Idiosyncratic entries for nouns, etc. need to be
listed
But, avoid duplicating the word done by the FST
morphology in the lexicon--mapping to categories
done in only one place

42
FST guessers

The morphologies are good, but dont have all
words
FST guessers can be written
work best for languages with lots of morphology
for English
-ed can be a verb or adjective
-ing can be a verb, noun, or adjective
-s can be a plural noun or 3sg verb
words starting with capitals can be proper nouns
etc.

43
Using multiple FSTs

How FSTs are used is declared in the MORPHCONFIG
The toy grammars use a default MORPHCONFIG
TOKENIZE and ANALYZE sections
Sections to specify
where the fsts are
how to treat multiword expressions

44
Example MORPHCONFIG

STANDARD ENGLISH MORPHOLOGY (1.0)
TOKENIZE
whitespace.fst tokenizer.fst
ANALYZE USEFIRST
main-morphology.fst
english-guesser.fst
ANALYZE USEALL
eureka-numbers.fst
eureka-novel-nouns.txt
----

45
Morphconfig cont.

TOKENIZE
whitespace.fst tokenizer.fst
The fsts listed are composed output of first is
input to second, etc.
Having multiple fsts
may avoid problems with large
compositions
allows for modularity

46
Morphconfig cont.

ANALYZE USEFIRST
main-morphology.fst
english-guesser.fst
Take as input the individual tokens from the
tokenizer
Apply the analyzers one by one until an analysis
is found. Once an analysis is found, it stops.
Effect of the above example
first try to find the analysis in the main
morphology
if that fails, guess the morphological analysis

47
Morphconfig cont.

ANALYZE USEALL
eureka-numbers.fst
eureka-novel-nouns.fst
Each morphological analyzer is applied to the
string, produces union of results
In the example, if a string could be both a
eureka number and a eureka novel noun, it will
get both analyses
It is not necessary to have both USEALL and
USEFIRST sections.

48
FST/XLE main points

XLE allows the incorporation of FSTS through the
MORPHCONFIG
Tokenizers, including special markup, and
morphological analyzers can be included
Large morphological analyzers in conjunction with
sublexical rules and the unknown lexical item
reduce the need for lexicon development

49
Integrating Shallow Mark up Part of speech
tags Named entities Syntactic brackets
50
Shallow mark-up of input strings

Part-of-speech tags (tagger?)
I/PRP saw/VBD her/PRP duck/VB.
I/PRP saw/VBD her/PRP duck/NN.
Named entities (named-entity recognizer)
ltpersongtGeneral Millslt/persongt bought it.
ltcompanygtGeneral Millslt/companygt bought it
Syntactic brackets (chunk parser?)
NP-S I saw NP-O the girl with the
telescope.
NP-S I saw NP-O the girl with the
telescope.

51
Hypothesis

Shallow mark-up
Reduces ambiguity
Increases speed
Without decreasing accuracy
(Helps development)

Issues
Markup errors may eliminate correct analyses
Markup process may be slow
Markup may interfere with existing robustness
mechanisms (optimality, fragments, guessers)
Backoff may restore robustness but decrease speed
in 2-pass system (STOPPOINT)

52
Implementation in XLE
How to integrate with minimal changes to existing
system/grammar?
53
XLE String Processing
lexical forms
Multiwords
Modify sequences
token morphemes
Morph,Guess, Tok
Analyze
tokens
Tthe TB oil TB filter TB s TB gone TB
Decap, split, commas
Tokenize
string
The oil filters gone
54
Part of speech tags
lexical forms
Multiwords
token morphemes
Analyze

How do tags pass thru Tokenize/Analyze?
Which tags constrain which morphemes?
How?

tokens
Tokenize
string
The/DET_ oil/NN_ filter/NN_s/VBZ_
gone/VBN_
55
Passing tags through Tokenizer

Tokenizer must treat tag characters specially
Must recognize them e.g. xxx/TAG_
Must not transform them e.g. x/NN_ ? x/nn_
Must not let tags interrupt other patterns
e.g. wo/MD_nt/RB_ should behave like
wont
Must split tags off as separate tokens, for
existing Token path through Analyzer
How to do this with minimal changes to existing
tokenizer FST?

tokens
Tokenize
string
56
Modifying an existing tokenizer

Tags shouldnt be transformed
Tags shouldnt disrupt any other patterns

Script for xfst program Tokenizer Tag
.o. Tokenizer/Tag
Dont transform
Dont disrupt
Glitch Ignore (/) introduces unwanted ambiguity
around insertions
Solution, a little less modularity Construct
Tokenizer using cover symbol for tags, placing
them wrt insertion Substitute actual
tag-strings for cover symbol
57
Specifying morpheme/pos-tag constraints

For each pos-tag, grammar/morphology writer
specifies by hand the set of compatible morph-tag
sequences
Inputs Description of pos-tag interpretation (
from Penn document)
List of all possible morph-tag sequences from
analyzer (from program run on Morph/Guesser
FSTs)
Output A text file that characterizes the
relationship
E.g. NNS is plural noun, so text file has
(NNS ( Noun Pl) (Noun SP) ( Abbr) )
PRP is personal pronoun, so text file has
(PRP ( Pron Pers Gen) (Pron Poss) )
Lisp program reads file, produces POSFilter
transducer
Allows NNS_ Token sequence only if preceded by
strings that contain
Noun and PL tags, or Noun and SP tags, etc.
POSFilter FST is put in MULTIWORD section,
knocks out undesired morpheme sequences.

58
All together
lexical forms
Multiwords
POSFilterFST
token morphemes
Analyze
tokens
Tokenize
Tokenize
POSStringFST
string
59
MORPHCONFIG

STANDARD ENGLISH MORPHOLOGY (1.0)
TOKENIZE
../common/englishpostags.stringfst
../common/english.tok.parse.fst
ANALYZE
../common/english.infl.fst
../common/english.morph.guesser.fst
MULTIWORD
../common/eng-infl-final.posfilterfst
BuildMultiwordsFromLexicon
Tag Prefer
BuildMultiwordsFromMorphology
Tag Prefer

60
Named entities Example input

parse ltpersongtMr. Thejskt Thejslt/persongt
arrived.
tokenized string
Mr. Thejskt Thejs TB NEperson Mr(TB). TB
Thejskt TB Thejs

. (.) TB (, TB) .
TB arrived
TB
61
Lexicon

Lexical entries for tags
NEperson NE_SFX _at_(PROPER name).
Lexical entry for token
-token TOKEN ( TOKEN)stem
NE _at_(NOUN stem)
_at_(GRAIN proper)
_at_(SOURCE entity-finder)
_at_(OT-MARK NamedEntity).

62
Grammar Rules

Rules
NOUN-ENTITY --gt NE NE_SFX.
NOUN --gt
_at_NOUN-ENTITY.
Config OT Mark
(MWE NamedEntity) STOPPOINT

63
Resulting C-structure
64
Resulting F-structure
65
Generation

Parsing string to analysis
Generation analysis to string
What type of input?
How to generate

66
Why generate?

Machine translation
Lang1 string -gt Lang1 fstr -gt Lang2 fstr -gt Lang2
string
Sentence condensation
Long string -gt fstr -gt smaller fstr -gt new string
Question answering
Production of NL reports
State of machine or process
Explanation of logical deduction
Grammar debugging

67
F-structures as input

Use f-structures as input to the generator
May parse sentences that shouldnt be generated
May want to constrain number of generated options
Input f-structure may be underspecified

68
XLE generator

Use the same grammar for parsing and generation
Advantages
maintainability
write rules and lexicons once
But
special generation tokenizer
different OT ranking

69
Generation tokenizer

White space
Parsing multiple white space becomes a single TB
John appears. -gt John TB appears TB . TB
Generation single TB becomes a single space (or
nothing)
John TB appears TB . TB -gt John appears.
John appears .

70
Generation tokenizer

Capitalization
Parsing optionally decap initially
They came -gt they came
Mary came -gt Mary came
Generation always capitalize initially
they came -gt They came
they came
May regularize other options
quotes, dashes, etc.

71
Generation morphology

Suppress variant forms
Parse both favor and favour
Generate only one

72
Morphconfig for parsing generation

STANDARD ENGLISH MOPRHOLOGY (1.0)
TOKENIZE
P!eng.tok.parse.fst G!eng.tok.gen.fst
ANALYZE
eng.infl-morph.fst G!amerbritfilter.fst
G!amergen.fst
----

73
Reversing the parsing grammar

The parsing grammar can be used directly as a
generator
Adapt the grammar with a special OT ranking
GENOPTIMALITYORDER
Why do this?
parse ungrammatical input
have too many options

74
Ungrammatical input

Linguistically ungrammatical
They walks.
They ate banana.
Stylistically ungrammatical
No ending punctuation They appear
Superfluous commas John, and Mary appear.
Shallow markup NP John and Mary appear.

75
Too many options

All the generated options can be linguistically
valid, but too many for applications
Occurs when more than one string has the same,
legitimate f-structure
PP placement
In the morning I left. I left in the morning.

76
Using the Gen OT ranking

Generally much simpler than in the parsing
direction
Usually only use standard marks and NOGOOD
no marks, no STOPPOINT
Can have a few marks that are shared by several
constructions
one or two for disprefered
one or two for prefered

77
Example Comma in coord

COORD(_CAT) _CAT _at_CONJUNCT
(COMMA _at_(OTMARK
GenBadPunct))
CONJ
_CAT _at_CONJUNCT.
GENOPTIMALITYORDER GenBadPunct NOGOOD.
parse They appear, and disappear.
generate without OT They appear(,) and
disappear.
with OT They appear and
disappear.

78
Example Prefer initial PP

S --gt (PP _at_ADJUNCT _at_(OT-MARK GenGood))
NP _at_SUBJ
VP.
VP --gt V
(NP _at_OBJ)
(PP _at_ADJUNCT).
GENOPTIMALITYORDER NOGOOD GenGood.
parse they appear in the morning.
generate without OT In the morning they appear.
They appear
in the morning.
with OT In the morning they
appear.

79
Generation commands

XLE command line
regenerate "They appear."
generate-from-file my-file.pl
(regenerate-from-directory, regenerate-testfile)
F-structure window
commands generate from this fs
Debugging commands
regenerate-morphemes

80
Debugging the generator

When generating from an f-structure produced by
the same grammar, XLE should always generate
Unless
OT marks block the only possible string
something is wrong with the tokenizer/morphology
regenerate-morphemes if this gets a
string
the tokenizer/morphology is not the
problem
Very hard to debug newest XLE has robustness
features to help

81
Underspecified Input

F-structures provided by applications are not
perfect
may be missing features
may have extra features
may simply not match the grammar coverage
Missing and extra features are often systematic
specify in XLE which features can be added and
deleted
Not matching the grammar is a more serious problem

82
Adding features

English to French translation
English nouns have no gender
French nouns need gender
Soln have XLE add gender
the French morphology will control
the value
Specify additions in xlerc
set-gen-adds add "GEND"
can add multiple features
set-gen-adds add "GEND CASE PCASE"
XLE will optionally insert the feature

Note Unconstrained additions make generation
undecidable
83
Example
The cat sleeps. -gt Le chat dort.

PRED 'dormirltSUBJgt'
SUBJ PRED 'chat'
NUM sg
SPEC def
TENSE present

PRED 'dormirltSUBJgt' SUBJ PRED 'chat'
NUM sg GEND masc
SPEC def TENSE present
84
Deleting features

French to English translation
delete the GEND feature
Specify deletions in xlerc
set-gen-adds remove "GEND"
can remove multiple features
set-gen-adds remove "GEND CASE PCASE"
XLE obligatorily removes the features
no GEND feature will remain in the f-structure
if a feature takes an f-structure value, that
f-structure is also removed

85
Changing values

If values of a feature do not match between the
input f-structure and the grammar
delete the feature and then add it
Example case assignment in translation
set-gen-adds remove "CASE"
set-gen-adds add "CASE"
allows dative case in input to become accusative
e.g., exceptional case marking verb in input
language but regular case in output language

86
Creating Paradigms

Deleting and adding features within one grammar
can produce paradigms
Specifiers
set-gen-adds remove "SPEC"
set-gen-adds add "SPEC DET DEMON"
regenerate "NP boys"
the those these boys

87
Generation for Debugging

Checking for grammar and lexicon errors
create-generator english.lfg
reports ill-formed rules, templates, feature
declarations, lexical entries
Checking for ill-formed sentences that can be
parsed
parse a sentence
see if all the results are legitimate strings
regenerate they appear.

88
Regeneration example

regenerate "In the park they often see the boy
with the telescope."
parsing In the park they often see the boy with
the telescope.
4 solutions, 0.39 CPU seconds, 178 subtrees
unified
They see the boy in the parkIn the park they
see the boy often with the telescope.
regeneration took 0.87 CPU seconds.

89
Regenerate testfile