Title: Corpus
1Corpus
2Corpus
- What is a corpus?
- A collection of naturally occurring language
text, chosen to characterize a state or variety
of a language. - John Sinclair, Corpus, Concordance, Collocation,
OUP, 1991 - Balanced corpus
- A corpus is a representative sample if what we
can find in the sample also holds for the general
population.
3Some Well-Known Corpora
- Brown Corpus
- Created in the 1960s at Brown University
- 1 Million words
- Balanced
- POS tagged
A01 0010 1 The Fulton County Grand Jury
said Friday an investigation A01 0020 1 of
Atlanta's recent primary election produced "no
evidence" A01 0020 9 that any irregularities
took place. A01 0030 5 The jury further
said in term-end presentments that A01 0040 3
the City Executive Committee, which had over-all
charge A01 0050 2 of the election, "deserves
the praise and thanks of A01 0050 11 the City
of Atlanta" for the manner in which the
election A01 0060 11 was conducted.
4British National Corpus (BNC)
- A 100 million word collection of samples of
written and spoken language from a wide range of
sources, designed to represent a wide
cross-section of current British English, both
spoken and written. - About 10 meters of shelf space if printed
5Some Well-Known Corpora
- TREC
- Text REtrieval Conference
- newspaper articles (majority)
- abstracts of scientific articles
- federal register
- 3GB of compressed text
- Used to test information retrieval systems
6Sample from the TREC Corpus
ltDOCgt
ltDOCNOgt
DOE1-03-0002 lt/DOCNOgt
ltTEXTgt
Pressure and fluid oscillations at the
steam injection into pool water
were
discussed from the view point of the conversion
of thermal energy
into work. When the
change of fluid state moves clockwise in the p-V
diagram, the oscillation sustains
since the thermal energy changes into
positive work. The oscillation threshold at the
condensation oscillation
was discussed
as putting the conversion ratio equal to zero.
The change
of oscillation pattern by
the steam mass flow at the chugging was also
discussed deriving the p-V diagram by a
numerical model of chugging.
lt/TEXTgt
lt/DOCgt
7Project Gutenberg
- History
- Project Gutenberg began in 1971 when Michael Hart
was given an operator's account with 100,000,000
of computer time on a mainframe at the University
of Illinois. - Contents
- Light Literature Alice in Wonderland,
- Heavy Literature Bible, Shakespeare, Moby Dick,
- References Roget's Thesaurus, almanacs,
8Some Well-Known Corpora
- Penn TreeBank
- Parsed trees of 1 million words of WSJ.
- Created by LDC at UPenn
- The largest treebank
- Created semi-automatically.
9 ( (S (NP-SBJ (NP Pierre Vinken) ,
(ADJP (NP 61 years)
old) ,) (VP will (VP
join (NP the board)
(PP-CLR as (NP a
nonexecutive director)) (NP-TMP Nov.
29))) .)) ( (S (NP-SBJ Mr. Vinken) (VP
is (NP-PRD (NP chairman)
(PP of (NP (NP Elsevier
N.V.) ,
(NP the Dutch publishing group)))))
.))
10Some Well-Known Corpora
- SUSANNE corpus
- Created by University of Sussex in England
- 1/7 of Brown corpus
- Manually parsed
- Checked many times
- Very well documented
11A010010a - YB ltminbrkgt -
Oh.Oh A010010b - AT The
the OSNnss. A010010c -
NP1s Fulton Fulton Nns. A010010d -
NNL1cb County county .Nns A010010e
- JJ Grand grand . A010010f
- NN1c Jury jury .Nnss A010010g
- VVDv said say
Vd.Vd A010010h - NPD1 Friday
Friday Nnst.Nnst A010010i - AT1
an an FnoNss. A010010j -
NN1n investigation investigation
. A010020a - IO of of
Po. A010020b - NP1t Atlanta
Atlanta NsGNns.Nns A010020c - GG
ltaposgts - .G A010020d
- JJ recent recent . A010020e
- JJ primary primary . A010020f
- NN1n election election
.NsPoNss A010020g - VVDv
produced produce Vd.Vd A010020h -
YIL ltldquogt - . A010020i -
ATn no no Nso. A010020j
- NN1u evidence evidence
. A010020k - YIR ltrdquogt
- . A010020m - CST that
that Fn. A010030a - DDy any
any Nps. A010030b - NN2
irregularities irregularity .Nps A010030c
- VVDv took take
Vd.Vd A010030d - NNL1c place
place Nso.NsoFnNsoFnoS A010030e
- YF . - .O A010030f
- YB ltminbrkgt - Oh.Oh
12Some Well-Known Corpora
- SemCor
- a 200,000 word corpus manually tagged by
lexicographers as part of the WordNet Project.
13Canadian Hansards
- A bilingual corpus of the proceedings of the
Canadian parliament Contains parallel texts in
English and French which have been used to
investigate statistically based machine
translation.
14ltPAIRgt ltENGLISHgt no , it is a falsehood .
lt/ENGLISHgt ltFRENCHgt non , ce est un mensonge .
lt/FRENCHgt lt/PAIRgt ltPAIRgt ltENGLISHgt Mr. Speaker ,
the record speaks for itself with regard to what
I said about the price of fertilizer .
lt/ENGLISHgt ltFRENCHgt monsieur le Orateur , ma
déclaration sur le prix de les engrais a été
confirmée par les événements . lt/FRENCHgt lt/PAIRgt
15Word Counting
- Simplest kind of statistics
- What is a word?
- The answer is not as easy as it looks.
- Space separated? What about punctuation marks?
- New York bookstores
- 22.50 McDonalds
- google.com cant
- OConnor Id
- Tiburon, Calif.-based data base
16Words in Chinese/Japanese/Korean
- No word boundary
- Can be treated in the same way as phrasal words
in English. - almost all words are phrasal words.
- Segmentation Problem
- Tokenize the Chinese text so that each token is a
word. - Lack of standard definition of what is a word.
17Words in Web Pages
- Issues tags, scripts, images
Source on the next page
18lt/tablegtltp classegtlttable border0 cellpadding1
cellspacing0 width100gtlttrgtlttd width1
valigntop nowrapgtltfont size-1
classfgtCategorynbspnbspnbsplt/fontgtlt/tdgtlttd
gtltfont size-1gtlta hrefhttp//directory.google.com
/Top/Regional/Europe/United_Kingdom/Education/Prod
ucts_and_Services/?tc1gtRegionalnbspgtnbspEu
ropenbspgtnbsp...nbspgtnbspEducationn
bspgtnbspProductsnbspandnbspServiceslt/agt
nbspnbsp lt/fontgtlt/tablegtltdivgtltp classggtlta
hrefhttp//www.cobuild.collins.co.uk/
onmousedown"return clk(1,this)"gtltbgtCobuildlt/bgt
Home Pagelt/agtltbrgtltfont size-1gtltbgtCobuildlt/bgt
English Dictionary, New Edition About
ltbgtCobuildlt/bgt About the Bank of English
Idiomltbrgt of the Day Wordwatch Feature About
WordbanksOnline Corpus Access ltbgtCobuildlt/bgt
ltbgt...lt/bgt ltbrgtltspan classfgtltfont
size-1gtDescriptionlt/fontgtlt/spangt Develops and
maintains corpora for modern written and spoken
text. Features an online resource of...ltbrgtltspan
classfgtCategory lt/spangtlta classfl
hrefhttp//directory.google.com/Top/Reference/Edu
cation/Products_and_Services/English_as_a_Second_L
anguage/?il1gtReferencenbspgtnbspEducationn
bspgtnbsp...nbspgtnbspEnglishnbspasn
bspanbspSecondnbspLanguagelt/agtltbrgtltfont
color008000gtwww.cobuild.collins.co.uk/ - 2k -
lt/fontgtlta classfl hrefhttp//216.239.39.104/sear
ch?qcacheZ7MzsR4S6hYJwww.cobuild.collins.co.uk/
cobuildhlenieUTF-8gtCachedlt/agt - lta classfl
href/search?hlenlrieUTF-8qrelatedwww.cobu
ild.collins.co.uk/gtSimilar pageslt/agtlt/fontgt
ltblockquote classggtltp classggtlta
hrefhttp//www.cobuild.collins.co.uk/about.html
onmousedown"return clk(2,this)"gtAbout
ltbgtCOBUILDlt/bgtlt/agtltbrgtltfont size-1gtWelcome to
ltbgtCobuildlt/bgt. If you39re interested in the
English ltbgt...lt/bgt A Brief Introductionltbrgt to
ltbgtCobuildlt/bgt. ltbgtCobuildlt/bgt is a department of
HarperCollins Publishers ltbgt...lt/bgt
19Tokenizer
- Space tokenizer
- A token is a consecutive sequence of characters
between white spaces. - Simple
- Fails in many cases
- Regular Expression tokenizer
- Use regular expressions to define tokens.
- The longest prefix that matches a regular
expression is a token. - Remove the token from the input stream and repeat
the process.
20Counting Words Example
- If you pay, the story rolls. If you dont, the
story folds. - 12 word tokens
- The number of words
- 8 word types.
- The number of distinct words
21Zipfs Law
- Zipfs Law
- Rank frequency constant
- English terms constant about .1
- Example
- the frequency count of the 50th most frequent
word is 3 times that of the 150th. - Implications
- 20 of the words covers 80 of the text
- Difficult to achieve (near) complete coverage.
22 1000rf/n
1000rf/n 1000rf/n the 59 from 92
or 101 of 58 he 95 about 102 to 82
million 98 market 101 a 98 year 100
they 103 in 103 its 100 this 105 and
122 be 104 would 107 that 75 was 105
you 106 for 84 company 109 which 107
is 72 an 105 bank 109 said 78 has 106
stock 110 it 78 are 109 trade 112 on
77 have 112 his 114 by 81 but 114
more 114 as 80 will 117 who 106 at
80 say 113 one 107 mr 86 new 112
their 108 with 91 share 114
23(No Transcript)
24What does Word Counts Tell us?
- Information retrieval (IR) systems use word
counts to determine the importance of words in a
document. - Two intuitions
- If a word is frequently used in a document, it is
probably important in the document. - If a word is frequently used in all documents, it
is not important in any of them.
25Keyword Extraction
- How to find the keywords in an document?
- Peter Turneys Web demo.
26Automatic Summarization
- Many IR systems have a feature called
summarization. - A summary of a document is typically a small
number of sentences in the document.
27What is a Sentence?
- Always treating .?! as sentence boundary is
correct about 92 of the time. - Abbreviations have . at the end
- But abbreviations could end sentence too!
- A good sentence boundary detector has over 99.8
accuracy.
28Sentence Boundary Detection
- For English, it is generally sufficient to look
at a token that contains a potential sentence
boundary ending mark (?.!) and the following
token.
29(No Transcript)
30Declare Sentence Boundary If
- The first token ends with a double quote (").
- The first token begins with a lower case letter
but is not one of - p.m. a.m. v. vs. v.s. i.e. cf. viz. e.g. p. pp.
- The second token is a word that often appears at
the sentence initial positions.
31- All of the following are true
- The first token is a corporate name designator
such as Inc. or Ltd. - The second token does not begin with an open
brace or parenthesis - The second token is either a title word, such as
president, or Dr. or a word containing no
periods, but not both - The second token is not a country name.
32- None of the following is true
- The first token is an abbreviation
- The first token is a capitalized word enclosed by
a pair of parentheses - The first token is one of the title words
- The second token is a single letter initial or a
number.
33Concordance
- KWIC Key Word In Context
- display word occurrences and their contexts
- align all occurrences of a word to make the left
and right context move visible. - Concordance is an important tool for when
building lexicons (dictionaries). - Example
- Cobuild web site.
34Concordance Example
I suspected that, aside from the sheer
amount of time Deirdre spent alone
incapacitating ones, because of the sheer amount
of role change required. p The end was not a
release - it was a sheer, blissful deliverance. I
tumbled modifications of slavery itself.
Sheer brute force was sufficient to get in
total darkness but the top of the sheer cliff on
the west side was tinged cultures of the two
continents? Is it sheer coincidence that the
poorer parts BONE coat. No other can match it
for sheer comfort amp downright toughness.
reality, not in dreams. For Day, the sheer
concreteness of Thrse's teachings months ago.
Sometimes, he says, the sheer contrast in living
standards makes For all we know, God
may take sheer delight in being probed
h Striped Semi-Sheer /h p Not sheer
enough to see throughhellipbut age
because it--it was almost all sheer
entertainment. And a lovely-- from
leaping overboard through sheer exuberance -- and
probably where the baby bunting in hellip
p The sheer familiarity of that ancient nursery
trackless scrub and finally stop in sheer
grass and sage surrounded by bush. filthy
rainwater in the gutter. Or on sheer hope and
courage, on days when even