Corpus - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Corpus

Description:

... English and French which have been used to investigate statistically based ... ENGLISH FRENCH ... English Dictionary, New Edition About b Cobuild /b About the Bank of ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 35
Provided by: barbara86
Category:

less

Transcript and Presenter's Notes

Title: Corpus


1
Corpus
2
Corpus
  • What is a corpus?
  • A collection of naturally occurring language
    text, chosen to characterize a state or variety
    of a language.
  • John Sinclair, Corpus, Concordance, Collocation,
    OUP, 1991
  • Balanced corpus
  • A corpus is a representative sample if what we
    can find in the sample also holds for the general
    population.

3
Some Well-Known Corpora
  • Brown Corpus
  • Created in the 1960s at Brown University
  • 1 Million words
  • Balanced
  • POS tagged

A01 0010 1 The Fulton County Grand Jury
said Friday an investigation A01 0020 1 of
Atlanta's recent primary election produced "no
evidence" A01 0020 9 that any irregularities
took place. A01 0030 5 The jury further
said in term-end presentments that A01 0040 3
the City Executive Committee, which had over-all
charge A01 0050 2 of the election, "deserves
the praise and thanks of A01 0050 11 the City
of Atlanta" for the manner in which the
election A01 0060 11 was conducted.
4
British National Corpus (BNC)
  • A 100 million word collection of samples of
    written and spoken language from a wide range of
    sources, designed to represent a wide
    cross-section of current British English, both
    spoken and written.
  • About 10 meters of shelf space if printed

5
Some Well-Known Corpora
  • TREC
  • Text REtrieval Conference
  • newspaper articles (majority)
  • abstracts of scientific articles
  • federal register
  • 3GB of compressed text
  • Used to test information retrieval systems

6
Sample from the TREC Corpus
ltDOCgt

ltDOCNOgt
DOE1-03-0002 lt/DOCNOgt

ltTEXTgt


Pressure and fluid oscillations at the
steam injection into pool water
were
discussed from the view point of the conversion
of thermal energy
into work. When the
change of fluid state moves clockwise in the p-V

diagram, the oscillation sustains
since the thermal energy changes into

positive work. The oscillation threshold at the
condensation oscillation
was discussed
as putting the conversion ratio equal to zero.
The change
of oscillation pattern by
the steam mass flow at the chugging was also

discussed deriving the p-V diagram by a
numerical model of chugging.

lt/TEXTgt

lt/DOCgt



7
Project Gutenberg
  • History
  • Project Gutenberg began in 1971 when Michael Hart
    was given an operator's account with 100,000,000
    of computer time on a mainframe at the University
    of Illinois.
  • Contents
  • Light Literature Alice in Wonderland,
  • Heavy Literature Bible, Shakespeare, Moby Dick,
  • References Roget's Thesaurus, almanacs,

8
Some Well-Known Corpora
  • Penn TreeBank
  • Parsed trees of 1 million words of WSJ.
  • Created by LDC at UPenn
  • The largest treebank
  • Created semi-automatically.

9
( (S (NP-SBJ (NP Pierre Vinken) ,
(ADJP (NP 61 years)
old) ,) (VP will (VP
join (NP the board)
(PP-CLR as (NP a
nonexecutive director)) (NP-TMP Nov.
29))) .)) ( (S (NP-SBJ Mr. Vinken) (VP
is (NP-PRD (NP chairman)
(PP of (NP (NP Elsevier
N.V.) ,
(NP the Dutch publishing group)))))
.))
10
Some Well-Known Corpora
  • SUSANNE corpus
  • Created by University of Sussex in England
  • 1/7 of Brown corpus
  • Manually parsed
  • Checked many times
  • Very well documented

11
A010010a - YB ltminbrkgt -
Oh.Oh A010010b - AT The
the OSNnss. A010010c -
NP1s Fulton Fulton Nns. A010010d -
NNL1cb County county .Nns A010010e
- JJ Grand grand . A010010f
- NN1c Jury jury .Nnss A010010g
- VVDv said say
Vd.Vd A010010h - NPD1 Friday
Friday Nnst.Nnst A010010i - AT1
an an FnoNss. A010010j -
NN1n investigation investigation
. A010020a - IO of of
Po. A010020b - NP1t Atlanta
Atlanta NsGNns.Nns A010020c - GG
ltaposgts - .G A010020d
- JJ recent recent . A010020e
- JJ primary primary . A010020f
- NN1n election election
.NsPoNss A010020g - VVDv
produced produce Vd.Vd A010020h -
YIL ltldquogt - . A010020i -
ATn no no Nso. A010020j
- NN1u evidence evidence
. A010020k - YIR ltrdquogt
- . A010020m - CST that
that Fn. A010030a - DDy any
any Nps. A010030b - NN2
irregularities irregularity .Nps A010030c
- VVDv took take
Vd.Vd A010030d - NNL1c place
place Nso.NsoFnNsoFnoS A010030e
- YF . - .O A010030f
- YB ltminbrkgt - Oh.Oh
12
Some Well-Known Corpora
  • SemCor
  • a 200,000 word corpus manually tagged by
    lexicographers as part of the WordNet Project.

13
Canadian Hansards
  • A bilingual corpus of the proceedings of the
    Canadian parliament Contains parallel texts in
    English and French which have been used to
    investigate statistically based machine
    translation.

14
ltPAIRgt ltENGLISHgt no , it is a falsehood .
lt/ENGLISHgt ltFRENCHgt non , ce est un mensonge .
lt/FRENCHgt lt/PAIRgt ltPAIRgt ltENGLISHgt Mr. Speaker ,
the record speaks for itself with regard to what
I said about the price of fertilizer .
lt/ENGLISHgt ltFRENCHgt monsieur le Orateur , ma
déclaration sur le prix de les engrais a été
confirmée par les événements . lt/FRENCHgt lt/PAIRgt
15
Word Counting
  • Simplest kind of statistics
  • What is a word?
  • The answer is not as easy as it looks.
  • Space separated? What about punctuation marks?
  • New York bookstores
  • 22.50 McDonalds
  • google.com cant
  • OConnor Id
  • Tiburon, Calif.-based data base

16
Words in Chinese/Japanese/Korean
  • No word boundary
  • Can be treated in the same way as phrasal words
    in English.
  • almost all words are phrasal words.
  • Segmentation Problem
  • Tokenize the Chinese text so that each token is a
    word.
  • Lack of standard definition of what is a word.

17
Words in Web Pages
  • Issues tags, scripts, images

Source on the next page
18
lt/tablegtltp classegtlttable border0 cellpadding1
cellspacing0 width100gtlttrgtlttd width1
valigntop nowrapgtltfont size-1
classfgtCategorynbspnbspnbsplt/fontgtlt/tdgtlttd
gtltfont size-1gtlta hrefhttp//directory.google.com
/Top/Regional/Europe/United_Kingdom/Education/Prod
ucts_and_Services/?tc1gtRegionalnbspgtnbspEu
ropenbspgtnbsp...nbspgtnbspEducationn
bspgtnbspProductsnbspandnbspServiceslt/agt
nbspnbsp lt/fontgtlt/tablegtltdivgtltp classggtlta
hrefhttp//www.cobuild.collins.co.uk/
onmousedown"return clk(1,this)"gtltbgtCobuildlt/bgt
Home Pagelt/agtltbrgtltfont size-1gtltbgtCobuildlt/bgt
English Dictionary, New Edition About
ltbgtCobuildlt/bgt About the Bank of English
Idiomltbrgt of the Day Wordwatch Feature About
WordbanksOnline Corpus Access ltbgtCobuildlt/bgt
ltbgt...lt/bgt ltbrgtltspan classfgtltfont
size-1gtDescriptionlt/fontgtlt/spangt Develops and
maintains corpora for modern written and spoken
text. Features an online resource of...ltbrgtltspan
classfgtCategory lt/spangtlta classfl
hrefhttp//directory.google.com/Top/Reference/Edu
cation/Products_and_Services/English_as_a_Second_L
anguage/?il1gtReferencenbspgtnbspEducationn
bspgtnbsp...nbspgtnbspEnglishnbspasn
bspanbspSecondnbspLanguagelt/agtltbrgtltfont
color008000gtwww.cobuild.collins.co.uk/ - 2k -
lt/fontgtlta classfl hrefhttp//216.239.39.104/sear
ch?qcacheZ7MzsR4S6hYJwww.cobuild.collins.co.uk/
cobuildhlenieUTF-8gtCachedlt/agt - lta classfl
href/search?hlenlrieUTF-8qrelatedwww.cobu
ild.collins.co.uk/gtSimilar pageslt/agtlt/fontgt
ltblockquote classggtltp classggtlta
hrefhttp//www.cobuild.collins.co.uk/about.html
onmousedown"return clk(2,this)"gtAbout
ltbgtCOBUILDlt/bgtlt/agtltbrgtltfont size-1gtWelcome to
ltbgtCobuildlt/bgt. If you39re interested in the
English ltbgt...lt/bgt A Brief Introductionltbrgt to
ltbgtCobuildlt/bgt. ltbgtCobuildlt/bgt is a department of
HarperCollins Publishers ltbgt...lt/bgt
19
Tokenizer
  • Space tokenizer
  • A token is a consecutive sequence of characters
    between white spaces.
  • Simple
  • Fails in many cases
  • Regular Expression tokenizer
  • Use regular expressions to define tokens.
  • The longest prefix that matches a regular
    expression is a token.
  • Remove the token from the input stream and repeat
    the process.

20
Counting Words Example
  • If you pay, the story rolls. If you dont, the
    story folds.
  • 12 word tokens
  • The number of words
  • 8 word types.
  • The number of distinct words

21
Zipfs Law
  • Zipfs Law
  • Rank frequency constant
  • English terms constant about .1
  • Example
  • the frequency count of the 50th most frequent
    word is 3 times that of the 150th.
  • Implications
  • 20 of the words covers 80 of the text
  • Difficult to achieve (near) complete coverage.

22
1000rf/n
1000rf/n 1000rf/n the 59 from 92
or 101 of 58 he 95 about 102 to 82
million 98 market 101 a 98 year 100
they 103 in 103 its 100 this 105 and
122 be 104 would 107 that 75 was 105
you 106 for 84 company 109 which 107
is 72 an 105 bank 109 said 78 has 106
stock 110 it 78 are 109 trade 112 on
77 have 112 his 114 by 81 but 114
more 114 as 80 will 117 who 106 at
80 say 113 one 107 mr 86 new 112
their 108 with 91 share 114
23
(No Transcript)
24
What does Word Counts Tell us?
  • Information retrieval (IR) systems use word
    counts to determine the importance of words in a
    document.
  • Two intuitions
  • If a word is frequently used in a document, it is
    probably important in the document.
  • If a word is frequently used in all documents, it
    is not important in any of them.

25
Keyword Extraction
  • How to find the keywords in an document?
  • Peter Turneys Web demo.

26
Automatic Summarization
  • Many IR systems have a feature called
    summarization.
  • A summary of a document is typically a small
    number of sentences in the document.

27
What is a Sentence?
  • Always treating .?! as sentence boundary is
    correct about 92 of the time.
  • Abbreviations have . at the end
  • But abbreviations could end sentence too!
  • A good sentence boundary detector has over 99.8
    accuracy.

28
Sentence Boundary Detection
  • For English, it is generally sufficient to look
    at a token that contains a potential sentence
    boundary ending mark (?.!) and the following
    token.

29
(No Transcript)
30
Declare Sentence Boundary If
  • The first token ends with a double quote (").
  • The first token begins with a lower case letter
    but is not one of
  • p.m. a.m. v. vs. v.s. i.e. cf. viz. e.g. p. pp.
  • The second token is a word that often appears at
    the sentence initial positions.

31
  • All of the following are true
  • The first token is a corporate name designator
    such as Inc. or Ltd.
  • The second token does not begin with an open
    brace or parenthesis
  • The second token is either a title word, such as
    president, or Dr. or a word containing no
    periods, but not both
  • The second token is not a country name.

32
  • None of the following is true
  • The first token is an abbreviation
  • The first token is a capitalized word enclosed by
    a pair of parentheses
  • The first token is one of the title words
  • The second token is a single letter initial or a
    number.

33
Concordance
  • KWIC Key Word In Context
  • display word occurrences and their contexts
  • align all occurrences of a word to make the left
    and right context move visible.
  • Concordance is an important tool for when
    building lexicons (dictionaries).
  • Example
  • Cobuild web site.

34
Concordance Example
I suspected that, aside from the sheer
amount of time Deirdre spent alone
incapacitating ones, because of the sheer amount
of role change required. p The end was not a
release - it was a sheer, blissful deliverance. I
tumbled modifications of slavery itself.
Sheer brute force was sufficient to get in
total darkness but the top of the sheer cliff on
the west side was tinged cultures of the two
continents? Is it sheer coincidence that the
poorer parts BONE coat. No other can match it
for sheer comfort amp downright toughness.
reality, not in dreams. For Day, the sheer
concreteness of Thrse's teachings months ago.
Sometimes, he says, the sheer contrast in living
standards makes For all we know, God
may take sheer delight in being probed
h Striped Semi-Sheer /h p Not sheer
enough to see throughhellipbut age
because it--it was almost all sheer
entertainment. And a lovely-- from
leaping overboard through sheer exuberance -- and
probably where the baby bunting in hellip
p The sheer familiarity of that ancient nursery
trackless scrub and finally stop in sheer
grass and sage surrounded by bush. filthy
rainwater in the gutter. Or on sheer hope and
courage, on days when even
Write a Comment
User Comments (0)
About PowerShow.com