Linguistic annotation - PowerPoint PPT Presentation

About This Presentation
Title:

Linguistic annotation

Description:

Constituent structure and structural ambiguity ... ?? hema 'river horse (hippopotamus)' ?? haishi 'sea lion (seal)' Used for: ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 57
Provided by: tse57
Category:

less

Transcript and Presenter's Notes

Title: Linguistic annotation


1
Linguistic annotation
  • 2/14/2006
  • Nianwen Xue

2
Outline
  • Tokenization / segmentation, POS tagging
  • Treebanking
  • Constituent structure and structural ambiguity
  • Basic grammatical relations and how argument
    structure is instantiated
  • Propbanking/nombanking
  • Cross-linguistic syntactic alternations, verb
    senses and argument structure
  • Others named entity, coreference, discourse
    connectives

3
Tokenization
  • English
  • In the new position he will oversee Mazda s U.S.
    sales , services , parts and marketing operations
    .
  • We did nt have much of a choice .
  • U.S. trade officials said the Philippines and
    Thailand would be the main beneficiaries of the
    president s action .
  • Anything s possible -- how about the new Guinea
    Fund ?

4
Tokenization
  • English
  • In the new position he will oversee Mazda s U.S.
    sales , services , parts and marketing operations
    .
  • We did nt have much of a choice .
  • U.S. trade officials said the Philippines and
    Thailand would be the main beneficiaries of the
    president s action .
  • Anything s possible -- how about the new Guinea
    Fund ?

5
Tokenization
  • The federal government suspended sales of the
    U.S. savings bonds because Congress has nt
    lifted the ceiling on government debt .
  • The Treasury said the U.S. will default on Nov. 9
    if Congress does nt act by then .

6
Tokenization
  • The federal government suspended sales of the
    U.S. savings bonds because Congress has nt
    lifted the ceiling on government debt .
  • The Treasury said the U.S. will default on Nov. 9
    if Congress does nt act by then .

7
Tokenization
  • Assets of the 400 taxable funds grew by 1.5
    billion during the latest week .
  • Exports in October stood 5.29 billion , a mere
    0.7 increase from a year earlier , while
    imports increased sharply to 5.39 billion , up
    20 from last year .
  • Do you notice any ambiguity in tokenization?

8
Tokenization
  • Assets of the 400 taxable funds grew by 1.5
    billion during the latest week .
  • Exports in October stood 5.29 billion , a mere
    0.7 increase from a year earlier , while
    imports increased sharply to 5.39 billion , up
    20 from last year .
  • Do you notice any ambiguity in tokenization ?

9
Exercise
  • How many sentences in the WSJ corpus of the Penn
    Treebank contain re?
  • How many sentences in the WSJ corpus of the Penn
    Treebank contain d?

10
Big deal, you say
  • The problem is pushed to the forefront for
    languages like Chinese, where there are no
    delimiting spaces between words
  • ?????????
  • Howmanywordsarethereinthissentence?

11
Big deal, you say
  • The problem is pushed to the forefront for
    languages like Chinese, where there are no
    delimiting spaces between words
  • zhe ju hua li you ji
    ge ci
  • ? ? ? ? ? ?
    ? ? ?
  • this CL sentence inside have how many CL
    word ?
  • How many words are there in this sentence ?

12
A much harder problem than it first appears
  • Well, what if we just create a list of words (a
    dictionary) and compare the sentence against this
    list?
  • ??????? ?
  • Dictionary entries ? Sun, ?? Japanese,,??
    article,,?? octopus, ? fish ?? how ? say

13
A much harder problem than it first appears
  • Well, what if we just create a list of words (a
    dictionary) and compare the sentence against this
    list?
  • ?? ?? ?? ? ?
  • Japanese Octopus how say
  • How do you say octopus in Japanese?
  • ? ?? ? ?? ? ?
  • Sun article fish how say
  • ???

14
Computer problem vs human problem
  • Well that may be a problem for the computer
    because the computer is dumb
  • Segmentation is difficult for humans as well
  • What is a word?
  • Different criteria do not coincide

15
What if we let native speakers follow their
intuitions?
  • Inadequate level of inter-annotator agreement
  • Sproat, 1996 70
  • Xue at al, 2005 90
  • Conclusion need a linguistic definition of
    wordhood to develop segmentation standards

16
Packards (2000) notion of words
  • Orthographic word Words are defined by
    delimiters in written text. This appears to have
    no relevance in Chinese since there are no such
    written delimiters
  • Sociological word Following (Chao, 1968, pp.
    136-138), these are that type of unit,
    intermediate in size between a phoneme and a
    sentence, which the general, non-linguistic
    public is conscious of, talks about, has an every
    day term for, and is practically concerned with
    in various ways. In English this is the lay
    notion of word, whereas in Chinese this is the
    character (?zi).

17
Packards notions of word
  • Lexical word This corresponds to Di Sciullo and
    Williamss (1987) listeme
  • Semantic word Roughly speaking this corresponds
    to a unitary concept.
  • Phonological word defined according to
    phonological criteria. Is it a domain that a
    phonological process applies? Is it s prosodic
    unit?

18
Packards notions of word
  • Morphological word following Di Sciullo and
    Williams (1987), a morphological word is anything
    that is the output of a phonological rule
  • Syntactic word These are all and only the
    constructions that occupy X0 in the syntax. Well
    first you need to know what X0 is.
  • Psycholinguistic word this the word level of
    linguistic analysis that is salient and highly
    relevant to the operation of the language
    processor

19
Wordhood tests
  • Phonological
  • Bound morpheme a bound morpheme forms a word
    with its neighboring morpheme
  • Syntactic
  • Insertion if another morpheme can be inserted
    between X and Y, then it is unlikely a word.
  • XP-substitution if a morpheme cannot be replaced
    with an XP of the same type, then it is likely to
    be a word

20
Wordhood tests
  • Semantic
  • If the meaning of X-Y is non-compositional, then
    it is a word
  • Others
  • Productivity if a rule that combines morpheme X
    and morpheme Y is not productive, then X-Y is
    likely to be a word
  • Frequency of co-ocurrence if morphemes X and Y
    co-occur frequently then they form a word

21
Exercise
  • Given the wordhood criteria and wordhood tests we
    have discussed, how many words are there in the
    cant ?

22
Answer
  • Orthographical word 1
  • Sociological word ?
  • Lexical word 2
  • Semantic word 2
  • Phonological word 1
  • Morphological word 2
  • Syntactic word 2
  • Psycholinguistic word ?

23
Chinese morphological types
  • Reduplication
  • Affixation
  • Compounding
  • Proper names
  • Abbreviations

24
Verbal reduplication
  • ?? shuo-shuo speak-speak
    speak a little
  • ?? kan-kan look-look
    take a look
  • ?? zou-zou walk-walk
    take a walk
  • ?? mo-mo rub-rub
    rub a little
  • ???? taolun-taolun discuss-discuss
    discuss a little
  • ???? qingjiao-qingjiao ask-ask ask
    a little
  • ???? yanjiu-yanjiu research-researchlook
    into

25
Verbal reduplication
  • ??? shuo-shuo speak one speak
    speak a little
  • ??? kan-kan look one look
    take a look
  • ??? zou-zou walk one walk
    take a walk
  • ??? mo-mo rub one rub
    rub a little
  • ????? taolun-yi-taolun discuss-one-discuss
  • ????? qingjiao-yi-qingjiao ask-one-ask
  • ????? yanjiu-yi-yanjiu research-one-researc
    h

26
Adjectival reduplication
  • ?? shufu ???? shushu-fufu comfortable
  • ???? shufu-shu-fu enjoy
  • ?? ganjing ???? gangan-jingjing very clean
  • ???? ganjing-ganjing clean
    up
  • ?? hutu ???? huhu-tutu
    muddle-headed
  • (?) ???? hutu-hutu
  • ?? ???? kuaikuai-huohuo happy
  • ???? kuaihuo-kuaihuo make
    happy
  • ?? ???? piaopiao-liangliangpretty
  • ???? piaoliang-piaoliangmake
    pretty

27
Prefixation
  • ? lao- ?? lao-wang old wang
  • ? xiao- ?? xiao-wang small wang
  • ? di- ?? di yi first
  • ? chu- ?? chu san the third
  • ? ke- ?? ke-ai cute
  • ?? ke-kao reliable

28
Suffixation
  • ? -xue ??? xinli-xue psychology
  • ? -jia ???? xinli-xue-jia psychologist
  • ? -hua ?? lv-hua greenize??
  • ? -lv ??? luqu-lv enrollment
    rate
  • ?? -zhuyi ????? makesi-zhuyimarxism

29
Compounding
  • Location
  • ?? ?? keting-shafa living room sofa
  • ? ? hema river horse
    (hippopotamus)
  • ? ? haishi sea lion (seal)
  • Used for
  • ?? ? zhijia you nail polish
  • ?? ? pingpang qiu ping-pang ball
  • ???? taiyang yanjing sunglasses
  • Material
  • ??? ?? talishi diban marble floor
  • ?? ? zhilaohu paper tiger

30
Resultative verb compounding
  • Result
  • ?? dapo break by hitting
  • ?? lakai open by pulling
  • Achievement
  • ??? xieqingchu write clearly
  • ?? maidao succeed in buying
  • Direction
  • ??? tiaoguoqu jump across
  • ??? zoujinlai come walking in

31
Subject-Verb compounds
  • ?? tou-teng (head hurt) have a headache
  • ?? zui-ying (mouth hard) stubborn
  • ?? yan-hong (eye red) covet
  • ?? xin-suan (heart sour) feel sad
  • ?? ming-ku (fate bitter) unlucky

32
Subject-Verb compounds
  • ? ? ? ? ?
  • I DE head very hurt
  • My head hurts badly.
  • ? ? ? ? ? ??
  • This matter make I very headache
  • This gave me a real headache.

33
Verb-object compounds
  • ?? chu-ban (emit edition) publish
  • ?? shui-jiao (sleep sleep) sleep
  • ?? bi-ye (finish study) graduate
  • ?? kai-dao (operate knife) operate
  • ??? kai-wanxiao (make joke) make a joke
  • ?? zhao-xiang (shine image) take a picture

34
Verb-object compounds
  • ? ??? !
  • Do not joke
  • Do not joke!
  • ? ? ? ???
  • Make he DE joke
  • Make fun of him.

35
Lets try one
  • ? ? ?? ?? ? ?? ??

Test type Test Test result Prediction
phonological Bound morpheme Yes? One word
phonological Syllable count yes One word
syntactic insertion no One word
syntactic XP substitution no One word
semantic Non-compositional yes One word
others productive N/A N/A
others frequency N/A N/A
36
But
  • ?? ? ? ?? ? ?

Test type Test Test result Prediction
phonological Bound morphemes? no Two words
phonological Syllable count yes One word
syntactic Insertion yes Two words
syntactic XP substitution yes Two words
Semantic Non-compositional? yes One word
Others Productive? N/A N/A
Frequent co-occurrence? N/A N/A
37
Summary
  • Wordhood has to be decided in context
  • When wordhood tests lead to conflict predictions,
    decisions will have to be made based on what the
    annotated corpus is for.

38
Discussion question
  • Based on word criteria we have discussed, is
    make headway one word or two words?

39
POS-tagging throwing words into different
buckets
  • Each category is a bucket
  • How many buckets are there?
  • Noun
  • Verb
  • Adjective
  • Preposition
  • Adverb
  • Which bucket shouldfive, the, , should go?

40
Penn Treebank Tagsets (buckets)
  • CC - coordinating conjunction and, but
  • CD - cardinal number one, two, three
  • DT - determiner a, the, this, that
  • EX - existential there
  • FW - foreign word
  • IN - preposition or subordinate conjunction
  • LS - list marker firstly, secondly
  • To - to
  • UH - interjection, uh, oh

41
CC or DT
  • Neither/?? he or/CC she likes skiing.
  • Neither/?? men like skiing .
  • Either/?? Jean or/CC Mary likes singing.
  • Either/?? Girl likes singing.
  • Both/?? Jack and/CC Tom hates singing .
  • Both/?? men hates singing.

42
CC or DT
  • Neither/CC he or/CC she likes skiing.
  • Neither/DT men like skiing .
  • Either/CC Jean or/CC Mary likes singing.
  • Either/DT Girl likes singing.
  • Both/CC Jack and/CC Tom hates singing .
  • Both/DT men hates singing.

43
CD or NN
  • One/?? of the best reasons
  • The only one/?? Of its kind
  • The only ones/?? of its kind

44
CD or NN
  • One/CD of the best reasons
  • The only one/NN Of its kind
  • The only ones/NN of its kind

45
EX or RB
  • There/?? was a party in progress.
  • There/?? ensued a melee.
  • There/?? , a party was in progress.
  • There/?? , ensued a melee.

46
EX or RB
  • There/EX was a party in progress.
  • There/EX ensued a melee.
  • There/RB , a party was in progress.
  • There/RB , ensued a melee.

47
The role of context in POS tagging
  • Can we take a list of all the words in a
    language, and decide which bucket each word
    should go, without looking at the context in
    which the word occurs?
  • Water, can,drops

48
Categorizing context
  • Morphological
  • Syntactic
  • Semantic

49
Morphological context
  • Inflectional morphology
  • Verb destroy, destroying, destroyed
  • Noun destruction, destructions
  • He watered the plant.
  • Derivational morphology
  • Noun destruction

50
Syntactic context
  • Verb The bomb destroyed the building.
  • He decided to water the plant.
  • Noun The destruction of building

51
Semantic context
  • Verb action, activity
  • Noun state, object, etc.

52
What do we have in Chinese?
  • Morphological clues not as much
  • Syntactic clues not as rich, but exist
  • Semantic clues About the same

53
Syntactic clues
  • Impoverished, but exist
  • ? ? ?? ? ??
  • this CL building DE collapse
  • the collapse of this building
  • ? ? ?? ???? ??
  • This CL building seem will collapse
  • It looks like this building will collapse.

54
Semantic clues
  • Same as English
  • Noun state, object, etc.
  • Verb action, activity, etc.

55
When syntactic and semantic clues are in conflict
  • ? ? ?? ? ??
  • this CL building DE collapse
  • the collapse of this building
  • Option 1 ?? is a verb regardless of its context
  • Option 2 ?? can be a noun or a verb depending on
    its context
  • The Chinese Treebank decision option 2
  • POS tags based on syntactic clues encode not only
    its own lexical properties, but also information
    provided by its context
  • context-free POS tags are no better than a
    dictionary

56
Online references
  • Chinese Treebank
  • www.cis.upenn.edu/chinese
  • Sproat, Richard. 2002. Coling tutorial
    www.linguistics.uiuc.edu/rws
  • Penn Treebank
  • www.cis.upenn.edu/treebank/home.html
Write a Comment
User Comments (0)
About PowerShow.com