Title: Introduction to CorporaStanford
1Introduction to Corpora_at_Stanford
- Florian Jaeger,
- tiflo_at_stanford.edu
- For the Methods class,
- December 3rd, 2003
2Some basic questions
- Where are our corpora? Where is the software?
- Is there a list of all the stuff we have?
- How can I access the software?
- Where do I start? What information is available
where? - Are there tutorials for the available software?
- What kind of corpus work is supported at
Stanford? - Corpora are only for those computational folks
-) - And the most important question
3Why bother at all
- Because we are often wrong with our (ad-hoc)
intuitions linguistic methodology is - well, lets not go there.
- While corpora have a lot of drawbacks (no
negative evidence, genre specific, etc.) they
offer a lot of opportunities. - To illustrate my point, a little case study
4Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
- Claim The interpretation of bare plurals does
not, actually, consist of any subset of
(well-defined) singulars. - 0.5 apples/apple
- 1.0 apples/apple
- 1.5 apples/apple
- zero apples/apple
5Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
- Hagit Borers judgments
- 0.5 apples/apple
- 1.0 apples/apple
- 1.5 apples/apple
- zero apples/apple
6Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
- Googles count
- 0.5 apples (120)/apple (179)
- 1.0 apples (42)/apple (23,600)
- 1.5 apples (59)/apple (362)
- zero apples (194)/apple (124)
- This also makes clear, some of the problems, so
lets take pears
7Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
- Googles count
- 0.1 pears (32)/pear (118)
- 0.5 pears (37)/pear (50)
- 0.7 pears (9)/pear (14)
- 1.0 pears (14)/pear (24,000)
- 1 pears (14)/?pear (7,480)
- One pears (1,130)/?pear (3,060)
- 1.5 pears (28)/pear (316)
- zero pears (3)/pear (0)
- Conclusion
- It is amazing how many programs or computers
products use fruit names. - The original judgments seem questionable.
- BUT can we trust Google?
8(No Transcript)
9Looking for a corpus
- There are several sites on the web that can help
you to find out if what you are looking for
exists - Databases like David Lees site (see also our Top
10 list) - The LDC database
- Our list of corpora (next page)
- Email lists, see our site under Support
- Local corpora_at_csli.stanford.edu
- Global MAJORDOMO_at_UIB.NO
10Types of corpora
- Different languages
- Different media (speech, video, text)
- Different levels of annotation
- No annotation
- Transcribed speech or video
- Sociological annotation (gender of speaker,
average age of audience, dialect of speaker,
etc.) - Discourse and textual information (publication
date, number of discourse participants,
discussion panel vs. novel, etc.) - Linguistic annotation (phonemes, prosody, syntax,
morpho-syntax, lexemes, phonological segments
syllables, etc.)
11Looking for a specific corpus
- List of available corpora
- If the corpus is on AFS
- If the corpus in on the Corpus Computer
- If the corpus is on CD
- If the corpus is on the WWW
- If the corpus has special license conditions
- If we dont have the corpus
12(No Transcript)
13Tools software
- General
- Where to start
- Local online tutorials (see also external
references and manuals) - The corpus TA
- corpora_at_csli.stanford.edu
- Little helpers
14A brief look at some tools
- BNC Web
- Problem Superiority who the hell
- Problem Distribution of is like age
dependent? - General information
- Age (easy export to e.g. Excel)
- Crosstabs
- TGrep2 and Tgrep
- Tutorial
- Examples
- tgrep2 -c wsj_mrg.t2c.gz -l 'VP
- tgrep2 -c wsj_mrg.t2c.gz -l 'VP
- tgrep2 -c wsj_mrg.t2c.gz -l 'VPfoo gave)
- tgrep2 -c wsj_mrg.t2c.gz -l 'VPfoo gave)
15Note Tgrep is right-headed
- The following pattern matches an S which has a
child A and another child that is a C and that
the A has a child B - S
- However, this pattern means that S has child A
and that A has children B and C - S
- It is equivalent to this
- S
16Some more Tgrep2 syntax
- A B.
- A B A is the child of B.
- A
- A N B A is the Nth child of B (the rst child is
1). - A
- A , B Synonymous with A 1 B.
- A child is
- A -N B A is the Nth-to-last child of B (the last
child is -1). - A A
- A - B A is the last child of B (synonymous with
A -1 B). - A with A
- A B A is the last child of B (also synonymous
with A -1 B). - A
- A B A is the only child of B
- A
17Some more TGrep2 syntax
- A B A is dominated by B (A is a descendant of
B). - A
- A , B A is a left-most descendant of B.
- A
- A B A is a right-most descendant of B.
- A and B is on it.
- A B There is a single path of descent from B
and A is on it. - A . B A immediately precedes B.
- A , B A immediately follows B.
- A .. B A precedes B.
- A ,, B A follows B.
- A B A is a sister of B (and A 6 B).
- A . B A is a sister of and immediately precedes
B. - A , B A is a sister of and immediately follows
B. - A .. B A is a sister of and precedes B.
- A ,, B A is a sister of and follows B.
- A B The node matched by A is also matched by B.
18The alternative with windows
- TigerSearch 2.1 screen shots
- Grammar search
- Collocation search
19The end my friends
- Want to help?
- The website can always use additions (short
blurbs about software, your opinion about the
user-friendliness of a certain web interface,
etc.) - Tschuessi!