Chinese WordSketch Online, corpus-based summaries of word usage - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Chinese WordSketch Online, corpus-based summaries of word usage

Description:

Online, corpus-based summaries of word usage Participants Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 26
Provided by: SueAt3
Category:

less

Transcript and Presenter's Notes

Title: Chinese WordSketch Online, corpus-based summaries of word usage


1
Chinese WordSketchOnline, corpus-based
summaries of word usage

2
Participants
  • Adam Kilgarriff, Lexical Computing, UK
  • David Tugwell, Tech University Budapest
  • Pavel Rychly, Brno University
  • Simon Smith, ???? (???)
  • ???, ???
  • ???, ???? (???)

3
Facing the problem lexical choice
  • You shall know a word by the company it keeps
    (Firth, 1957)
  • The meaning of face depends on the collocation
    (????)
  • ?????????????????
  • ???????????
  • Similarly with save
  • Save money
  • Save life
  • Save a seat for me

4
Look in a dictionary? A corpus?
  • Some modern English dictionaries give some
    collocation (????) information
  • Chinese dictionaries give very limited help
  • Since the 1980s, corpus KWIC (KeyWord In Context)
    concordances have been available

5
  • Pre-computer corpus!
  • Oxford English
  • Dictionary
  • 20 million
  • index cards

6
KWIC Concordance
7

The coloured pens method
1 political association 4 person in an
agreement/dispute 2 social event
5 to be party to something... 3 group of
people
8
Limitation of KWIC analysis
  • As corpora get bigger too much data
  • 50 lines for a word read all
  • 500 lines could read all, takes a long time
  • 5000 lines no
  • Instead, create a statistical summary of word
    usage
  • Show most salient ????? collocates (Mutual
    Information)

9
Mutual Information
  • Church and Hanks 1989
  • MI How much more often does a word pair occur,
    than one might expect by chance

10
Collocation listing
For right collocates of save (gt5 hits)
word f(xy) f(y) word f(xy) f(y)
forests 6 170 life 36 4875
1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
11
Limitations of collocation listing
  • Some items are not genuine collocates
  • yours appears only because it is adjacent to save
  • The collocates can belong to any part of speech
  • It would better if they were classified into POS
  • and the role they play in the sentence
  • Thus,
  • for arrest in The police were quick to arrest a
    number of suspects on the spot
  • We would like to see
  • Keyword arrest
  • Subject police
  • Object suspect(s)
  • Modifier on the spot

12
Wordsketch
  • Attempts to meet these requirements
  • A corpus-derived one-page summary of a words
    grammatical and collocational behaviour
  • Implemented for English and Czech
  • Chinese and Irish implementations in progress

13
The corpus Chinese Gigaword
  • A Linguistic Data Consortium corpus
  • Very large over 1 billion characters
  • Compiled by David Graff Ke Chen in 2003
  • Minimally tagged
  • 286 newswire stories, half from each of
  • CNA Taiwan (740 million traditional characters)
  • Xinhua PRC (380 million simplified characters)
  • Corpus was segmented and tagged using Academia
    Sinica tools

14
http//corpora.fi.muni.cz/chinese/
  • ??
  • ?
  • ??
  • ??
  • ?
  • http//corpora.fi.muni.cz/chinese/

15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Functions
  • KWIC concordance
  • Sorting, filtering etc
  • Word sketch
  • Automatic thesaurus
  • Sketch difference
  • discriminate near-synonyms
  • In development
  • key words in a subcorpus / text type
  • how word varies with text type

22
(No Transcript)
23
Grammar writing
  • Uses CQL (Corpus query language)
  • Christ and Schulze, U. Stuttgart, 1994
  • defining an object
  • v (adjndetnumadv) n
  • rewriting in CQL with BNC/CLAWS-5 tags
  • tag"VV." tag"(AJTVDO)." tag"NN."

24
Further work
  • Improve grammatical relations, especially
    sentence objects, to account for
  • topicalization (??,???,????)
  • ? fronting (??????)
  • Create Dr Eye style interface, to show common
    collocations online, in a text

25
English version available
  • For personal use
  • www.sketchengine.co.uk
  • ??????????!
Write a Comment
User Comments (0)
About PowerShow.com