Introduction to CorporaStanford - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to CorporaStanford

Description:

0.5 apples/apple. 1.0 apples/apple. 1.5 apples/apple. zero ... zero apples (194)/*apple (124) This also makes clear, some of the problems, so let's take pears ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 20
Provided by: florian5
Learn more at: https://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to CorporaStanford


1
Introduction to Corpora_at_Stanford
  • Florian Jaeger,
  • tiflo_at_stanford.edu
  • For the Methods class,
  • December 3rd, 2003

2
Some basic questions
  • Where are our corpora? Where is the software?
  • Is there a list of all the stuff we have?
  • How can I access the software?
  • Where do I start? What information is available
    where?
  • Are there tutorials for the available software?
  • What kind of corpus work is supported at
    Stanford?
  • Corpora are only for those computational folks
    -)
  • And the most important question

3
Why bother at all
  • Because we are often wrong with our (ad-hoc)
    intuitions linguistic methodology is
  • well, lets not go there.
  • While corpora have a lot of drawbacks (no
    negative evidence, genre specific, etc.) they
    offer a lot of opportunities.
  • To illustrate my point, a little case study

4
Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
  • Claim The interpretation of bare plurals does
    not, actually, consist of any subset of
    (well-defined) singulars.
  • 0.5 apples/apple
  • 1.0 apples/apple
  • 1.5 apples/apple
  • zero apples/apple

5
Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
  • Hagit Borers judgments
  • 0.5 apples/apple
  • 1.0 apples/apple
  • 1.5 apples/apple
  • zero apples/apple

6
Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
  • Googles count
  • 0.5 apples (120)/apple (179)
  • 1.0 apples (42)/apple (23,600)
  • 1.5 apples (59)/apple (362)
  • zero apples (194)/apple (124)
  • This also makes clear, some of the problems, so
    lets take pears

7
Hagit Borer Some notes on the Syntax and
Semantics of QuantityTalk for the Sem.
Workshop, 10/31/2002
  • Googles count
  • 0.1 pears (32)/pear (118)
  • 0.5 pears (37)/pear (50)
  • 0.7 pears (9)/pear (14)
  • 1.0 pears (14)/pear (24,000)
  • 1 pears (14)/?pear (7,480)
  • One pears (1,130)/?pear (3,060)
  • 1.5 pears (28)/pear (316)
  • zero pears (3)/pear (0)
  • Conclusion
  • It is amazing how many programs or computers
    products use fruit names.
  • The original judgments seem questionable.
  • BUT can we trust Google?

8
(No Transcript)
9
Looking for a corpus
  • There are several sites on the web that can help
    you to find out if what you are looking for
    exists
  • Databases like David Lees site (see also our Top
    10 list)
  • The LDC database
  • Our list of corpora (next page)
  • Email lists, see our site under Support
  • Local corpora_at_csli.stanford.edu
  • Global MAJORDOMO_at_UIB.NO

10
Types of corpora
  • Different languages
  • Different media (speech, video, text)
  • Different levels of annotation
  • No annotation
  • Transcribed speech or video
  • Sociological annotation (gender of speaker,
    average age of audience, dialect of speaker,
    etc.)
  • Discourse and textual information (publication
    date, number of discourse participants,
    discussion panel vs. novel, etc.)
  • Linguistic annotation (phonemes, prosody, syntax,
    morpho-syntax, lexemes, phonological segments
    syllables, etc.)

11
Looking for a specific corpus
  • List of available corpora
  • If the corpus is on AFS
  • If the corpus in on the Corpus Computer
  • If the corpus is on CD
  • If the corpus is on the WWW
  • If the corpus has special license conditions
  • If we dont have the corpus

12
(No Transcript)
13
Tools software
  • General
  • Where to start
  • Local online tutorials (see also external
    references and manuals)
  • The corpus TA
  • corpora_at_csli.stanford.edu
  • Little helpers

14
A brief look at some tools
  • BNC Web
  • Problem Superiority who the hell
  • Problem Distribution of is like age
    dependent?
  • General information
  • Age (easy export to e.g. Excel)
  • Crosstabs
  • TGrep2 and Tgrep
  • Tutorial
  • Examples
  • tgrep2 -c wsj_mrg.t2c.gz -l 'VP
  • tgrep2 -c wsj_mrg.t2c.gz -l 'VP
  • tgrep2 -c wsj_mrg.t2c.gz -l 'VPfoo gave)
  • tgrep2 -c wsj_mrg.t2c.gz -l 'VPfoo gave)

15
Note Tgrep is right-headed
  • The following pattern matches an S which has a
    child A and another child that is a C and that
    the A has a child B
  • S
  • However, this pattern means that S has child A
    and that A has children B and C
  • S
  • It is equivalent to this
  • S

16
Some more Tgrep2 syntax
  • A B.
  • A B A is the child of B.
  • A
  • A N B A is the Nth child of B (the rst child is
    1).
  • A
  • A , B Synonymous with A 1 B.
  • A child is
  • A -N B A is the Nth-to-last child of B (the last
    child is -1).
  • A A
  • A - B A is the last child of B (synonymous with
    A -1 B).
  • A with A
  • A B A is the last child of B (also synonymous
    with A -1 B).
  • A
  • A B A is the only child of B
  • A

17
Some more TGrep2 syntax
  • A B A is dominated by B (A is a descendant of
    B).
  • A
  • A , B A is a left-most descendant of B.
  • A
  • A B A is a right-most descendant of B.
  • A and B is on it.
  • A B There is a single path of descent from B
    and A is on it.
  • A . B A immediately precedes B.
  • A , B A immediately follows B.
  • A .. B A precedes B.
  • A ,, B A follows B.
  • A B A is a sister of B (and A 6 B).
  • A . B A is a sister of and immediately precedes
    B.
  • A , B A is a sister of and immediately follows
    B.
  • A .. B A is a sister of and precedes B.
  • A ,, B A is a sister of and follows B.
  • A B The node matched by A is also matched by B.

18
The alternative with windows
  • TigerSearch 2.1 screen shots
  • Grammar search
  • Collocation search

19
The end my friends
  • Want to help?
  • The website can always use additions (short
    blurbs about software, your opinion about the
    user-friendliness of a certain web interface,
    etc.)
  • Tschuessi!
Write a Comment
User Comments (0)
About PowerShow.com