Title: Chinese WordSketch Online, corpus-based summaries of word usage
1Chinese WordSketchOnline, corpus-based
summaries of word usage
2Participants
- Adam Kilgarriff, Lexical Computing, UK
- David Tugwell, Tech University Budapest
- Pavel Rychly, Brno University
- Simon Smith, ???? (???)
- ???, ???
- ???, ???? (???)
3Facing the problem lexical choice
- You shall know a word by the company it keeps
(Firth, 1957) - The meaning of face depends on the collocation
(????) - ?????????????????
- ???????????
- Similarly with save
- Save money
- Save life
- Save a seat for me
4Look in a dictionary? A corpus?
- Some modern English dictionaries give some
collocation (????) information - Chinese dictionaries give very limited help
- Since the 1980s, corpus KWIC (KeyWord In Context)
concordances have been available
5- Pre-computer corpus!
- Oxford English
- Dictionary
- 20 million
- index cards
6KWIC Concordance
7The coloured pens method
1 political association 4 person in an
agreement/dispute 2 social event
5 to be party to something... 3 group of
people
8Limitation of KWIC analysis
- As corpora get bigger too much data
- 50 lines for a word read all
- 500 lines could read all, takes a long time
- 5000 lines no
- Instead, create a statistical summary of word
usage - Show most salient ????? collocates (Mutual
Information)
9Mutual Information
- Church and Hanks 1989
- MI How much more often does a word pair occur,
than one might expect by chance
10Collocation listing
For right collocates of save (gt5 hits)
word f(xy) f(y) word f(xy) f(y)
forests 6 170 life 36 4875
1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
11Limitations of collocation listing
- Some items are not genuine collocates
- yours appears only because it is adjacent to save
- The collocates can belong to any part of speech
- It would better if they were classified into POS
- and the role they play in the sentence
- Thus,
- for arrest in The police were quick to arrest a
number of suspects on the spot - We would like to see
- Keyword arrest
- Subject police
- Object suspect(s)
- Modifier on the spot
12Wordsketch
- Attempts to meet these requirements
- A corpus-derived one-page summary of a words
grammatical and collocational behaviour - Implemented for English and Czech
- Chinese and Irish implementations in progress
13The corpus Chinese Gigaword
- A Linguistic Data Consortium corpus
- Very large over 1 billion characters
- Compiled by David Graff Ke Chen in 2003
- Minimally tagged
- 286 newswire stories, half from each of
- CNA Taiwan (740 million traditional characters)
- Xinhua PRC (380 million simplified characters)
- Corpus was segmented and tagged using Academia
Sinica tools
14http//corpora.fi.muni.cz/chinese/
- ??
- ?
- ??
- ??
- ?
- http//corpora.fi.muni.cz/chinese/
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Functions
- KWIC concordance
- Sorting, filtering etc
- Word sketch
- Automatic thesaurus
- Sketch difference
- discriminate near-synonyms
- In development
- key words in a subcorpus / text type
- how word varies with text type
22(No Transcript)
23Grammar writing
- Uses CQL (Corpus query language)
- Christ and Schulze, U. Stuttgart, 1994
- defining an object
- v (adjndetnumadv) n
- rewriting in CQL with BNC/CLAWS-5 tags
- tag"VV." tag"(AJTVDO)." tag"NN."
24Further work
- Improve grammatical relations, especially
sentence objects, to account for - topicalization (??,???,????)
- ? fronting (??????)
- Create Dr Eye style interface, to show common
collocations online, in a text
25English version available
- For personal use
- www.sketchengine.co.uk
- ??????????!