Title: move everything that starts with tekstmanip to th
1ReMa Corpus Linguistics - LTR010M10
- Gertjan van Noord (Alfa-Informatica)
- G.J.M.van.Noord_at_rug.nl
- Met dank aan Lonneke van der Plas
2http//www.let.rug.nl/vannoord/College/REMA/
- That is the website where you can find all latest
info. - Schedule.
- Assignments for the practicals will be put there.
- Handouts of slides of each lecture can be found
there.
3What you have to do to pass
- Test UNIX tools (practical session of week 3 or 4
- Corpus linguistics study (end of course)
- Oral presentation of your study
4Goal of the course
- Main goal is to teach research master students
what corpora are available, what their nature is
and to provide you with tools that will help you
to use these corpora for your (future) research. - You will get some theoretical background, but
very limited. - I want to spend most of the time and energy into
getting you ready to use UNIX tools and XML
search tools.
5UNIX
- Unix is an operating system. (ATT, 1969) plenty
of variations (Linux). - Shells
- Commands cd, ls, pwd, mkdir
6Why UNIX?
- As part of UNIX you get many tools that you need
for corpus linguistics, such as counting
frequencies of words etc. for free. - No need for interfaces (always different ones for
each corpus) with limited functionalities. - We have a large collection of corpora on UNIX
machines.
7What is a corpus?
- A collection of linguistic data, either written
texts or a transcription of recorded speech,
which can be used as a starting-point of
linguistic description or as a means of verifying
hypotheses about a language. (David Crystal, A
Dictionary of Linguistics and Phonetics,
Blackwell, 3rd Edition, 1991) - A collection of naturally occurring language
text, chosen to characterize a state or variety
of a language. (John Sinclair, Corpusize
Concordance, Collocation, OUP, 1991)
8 What should it look like?
- Sampling and representativeness. The texts in a
corpus must be collected in a systematic way,
under controlled conditions, and in such a way
that the corpus re?ects the true distribution of
the language/dialect/variety under study. - Finite size. In order to allow scientific and
reproducible study, a corpus should be of ?xed
and ?nite size, but often millions or 100s of
millions of words. - Machine readable. To do anything interesting a
corpus has to be machine readable and preferably
annotated.
9Some well-known corpora
- the BROWN corpus
- the LOB corpus
- the BNC corpus
- CHILDES
10Brown Corpus
- First major computational corpus project began by
Francis and Kucera at Brown University
(19611964). - Designed in co-operation with grammarian Quirk
and others. - One million words drawn from randomly selected
material written in American English in a variety
of genres.
11LOB corpus
- Brown corpus is a snapshot of written American
English in 1961. - Lancaster-Oslo/Bergen (LOB) corpus is a
collection of British English sampled in the same
way. Under the direction of Geo?rey Leech. - Allows direct comparison between American and
British dialects.
12BNC corpus
- 100 million words
- 90 written British English
- Imaginative, natural science, social science,
world a?airs, commerce, arts, belief and thought,
leisure, other. - 10 spoken British English
- Individual interviews, educational, business,
institutional, leisure, other.
13CHILDES
- The CHILDES database contains transcript and
media data collected from conversations between
young children and their playmates and
caretakers. Conversations with older children and
adults at available from TalkBank. All of the
data is transcribed in CHAT and CA/CHAT formats.
14Corpora for Dutch
- Eindhoven corpus written and spoken Dutch
(periods 1964-1971 and 1960-1973) 600,000
written words 120,000 spoken words - Corpus Gesproken Nederlands (CGN) spontaneous
conversation, interviews, discussions, etc.
FlemishDutch Dutch. 10 million words
http//lands.let.kun.nl/cgn/ehome.htm - Twente Nieuws Corpus (TwNC) (release 2002)
newspaper, teletext, subtitles, broadcast news,
etc. now almost 500 million words
http//www.vf.utwente.nl/druid/TwNC/TwNC-main.htm
l
15Corpora for Dutch
- SONAR
- gt500 million word corpus
- Lassy
- Lassy Small one million word corpus, fully
annotated syntactically, manually corrected - Lassy Large one billion word corpus, fully
annotated syntactically, but not manually
corrected (includes SONAR)
16What if you want to make your own corpus?
- It's a lot of work!
- Sampling and representativeness Which texts will
you include? Some parts of a text, you might not
want to include (headers footers, pictures, etc) - Machine readable Documents can be in any format
pdf, Microsoft word, html. Conversion is not
trivial. - You might want to annotate it.
17What can we do with corpora?
- Studies on linguistic structure such as grammar
particles and choice of word order in Dutch verb
?nal clusters (dat Jan Marie aan zou aan
hebben aan gesproken) - Lexicography modern dictionaries are created by
exploring large corpora to determine frequency of
words. Which words should be included?
collocations and phraseology of individual words.
What should the lexical entry for credit include?
credit card, credit service
18What can we do with corpora? (cont'd)
- Cultural studies frequency and connotation of a
political ideology - Translation studies parallel corpora more often
used to describe translation options and
preferences - Forensic linguistics determine the authorship of
a document by comparing linguistic features in
the disputed document(s), in undisputed documents
and in a general corpus.
19UNIX (recap)
- Unix is an operating system. (ATT, 1969) plenty
of variations (Linux). - Multi-users one host - many terminals
- Multiprocessors, timesharing
- Shells
- Standard processes with commands Ex. cd, ls,
pwd, mkdir - Standard functions cstop the command,
dend-of-file,... ( CTRL).
20Shell
- The shell is a computer programme that allows the
user to give commands to the computer - It takes in input (standard from the keyboard)
- And it gives back output (standard to the screen)
21(No Transcript)
22UNIX file system
- At the top root /
- Path /home/its/ug1/ee51vn/pics
- your home directory
- . is the actual working directory
- .. parent directory of working directory
- ../.. grand-parent ,, ,, ,, ,,
23About file names
- UNIX is case sensitive
- No extension (like .exe, .bat, .doc or .rtf) The
period ('.') is just considered to be one
character of the file name gt my.first.file.name - Furthermore, you do better not use the following
characters in your file names space comma / ( )
' " ? lt gt \ and avoid using - as the first
character as these characters have special
meanings. - So not My first try at making a corpus study.doc
- Use _ instead of spaces.
24Unix commands
- Commands are either functionalities of the
running shell, or program files that you can
simply launch from the shell. - The general syntax of them is
- command -options arguments
- means that they are optional.
- You can get information on commands with
- man ltcommand-namegt Ex man cd
25Some handy commands
- cat gt file_name ... d make a filecat
file_name content of file - passwd change your password
- exit quit the shell
- cd change (working) directoryGoing home with
'cd' or 'cd ' - pwd print working directory
- who who are logged in? 'whoami' or 'who am i'
26Some handy commands
- ls list
- ls lists the actual working directory ls
ltdirectorygt lists the given files or directories - ls -l gives a long list with more info
- ls -R lists recursively the subdirectories
- less view content of a file
- tail view only last part
- head view only first part
27Some handy commands
- mkdir make directory
- rm, rmdir remove files and directories
- (watch out!)
- rm -r removes the subdirectories recursively
rm -i asks for affirmation - cp copy
- mv move (copy delete the original one)
- renaming a file mv oldname newname
28Examples
- cp ../John Jack a new file is created in the
actual working directory, whose name is Jack',
and whose content is identical to the content of
the file John' which is located in the parent
directory. - mv Mary .. the file Mary', which is located
in the current working directory, is moved one
level up in the tree structure - mv ../../Willy . the file Willy' has been
moved from the "grand mother directory" to the
actual working directory (where I am now)
29Unix commands
- echo displays a line of text.
- Example gt echo hello word hello word
- expr evaluates a mathematical expression.
Example - gt expr 3 5
- 8
30Wildcards
- Possibilities to refer to more than one file
- any sequence of zero or more characters
- ? denotes a single character
- any single character in the cset.
- - range
- ! NOT
31Examples
- x any name beginning with 'x'
- x, xold, xerxes
- x any name containing an 'x'
- x, xold, fox, maxi, xx
- x? any two-character-long name beginning with
'x' - xx, xy, x2
32Wildcards(more)
- ? denotes a single character
- any single character in the cset.
- - range
- ! NOT
- any sequence of zero or more characters
33Examples
- xaeiou any two-character-long name beginning
with an 'x' followed by a vowel (e.g. xa, xu) - xaeiou any name beginning with an 'x'
followed by a vowel (e.g. xa, xaver, xerxes) - xaeiouabcx any name beginning with an 'x'
followed by a vowel, then any characters (or
none), then 'a' or 'b' or 'c', then any
characters (none or one or more), and ending with
an 'x' (e.g. xabx, xanax, xenmnaqwx). - . any name containing a period
34Wildcards(more)
- any sequence of zero or more characters
- ? denotes a single character
- any single character in the cset.
- - range
- ! NOT
35Examples
- A-Z any name beginning with a capital letter.
- 1-9 any non-zero number
- ???? any four-character-long name
- ???? any at least four-character-long name
- ???0-9.x any at least four-character-long
name ending with a numeral, a period or an 'x' - !T any name not beginning with a capital
T.
36When do you use them?
- mv tekstmanip alltekstmanip/
- move everything that starts with tekstmanip to
the tekstmanip directory - ls ../pdf list all files in parent directory
that have the pdf extension
37End of week 1See you at the practicum!!
38metacharacters
- Let's try to calculate how much is (3 4) 7?
- expr ( 3 4 ) 7 it does not work
- The problem is that some characters have special
meanings, those characters are called
metacharacters. - We have seen so far the special meanings of the
characters '','?', '' and ''. - and space has also a special meaning it is the
delimiter between two arguments of a command,
therefore a file name containing a space will
also cause problems under all versions of Unix.
Use _ instead
39escapes quotes
- There are two types of neutralizing characters.
- The escape character \ (backslash) neutralize
next character - The two types of quotes ('...' and "...")
neutralize all the metacharacters within them
(simplifying) - echo what does mean ?
- echo what does \ mean \?
- echo what does '' mean '?'
- echo what does "" mean "?"
40Examples with expr
- expr 3 \ 7
- expr 3 '' 7
- expr "3 7" error!
- expr ( 3 4 ) 7 error!
- expr \( 3 4 \) "" 7
41File filters
- We've seen cat it prints the content of a file
to the screen - instead of just forwarding you can filter the
content of a file. - head outputs the first part of a file (first
10 lines by default) head -c N prints the
first N bytes head -n N gives the first N
lines - tail outputs the last part of a file (first 10
line by default) - tail -c N prints the last N bytes, etc
42More file filters
- rev reverses the lines of a file
- wc prints line, word, and character count
- sort sort FILE to standard output. -r
reverse the result of comparisons -n
sort numerically - uniq it removes the duplicate lines from a
sorted file -c puts a prefix before
each line, giving the number of occurrences
43Input/output
- standard input is the keyboard (the things you
type) - standard output is the screen
- Unless you specify standard input/output to be
for instance a file - You do that with the gt , gtgt and lt
- lt the input is taken from the specified
file gt the output goes to the specified
file (if it exists then it wil be overwritten)
gtgt the output goes to the specified file
(if it already exists then the output is appended
to its previous content)
44Input/output
-l or -q
filename
options
arguments
stout
stin
command 1
return value
standard error
45Input/output
options
arguments
options
arguments
stdout
stdin
stdout
stdin
command 2
command 1
return value
return value
standard error
standard error
46Input/output
- If you want to use the output of a command as the
input of the other use a (pipe) - cal 2005 wc
- If you want to use the output of a command as an
argument for another command ... (back
quotation marks) - example............................
- If you want to combine commands use (......)
- example...............
47tr (translate)
- There is more you can do with files making
changes in content with tr. - tr k Q lt test replaces each k in test with a Q
- tr kz KZ lt test replaces k with K and z with Z
- Options
- -d deletes all the tokens of the characters in
set or its complement if -c is added as well - -s squeezes all repetitions of characters in
set into a single character.
48Examples with tr
- tr -s all sequences of spaces are
condensed into a single space - tr -d -c 0-9 remove all non-numerical
characters (-c complement of set1) - A very handy property of tr is that you can refer
to the ASCII code of characters through it. We
shall use it very often. The way to do it is
'\XXX', where the quotation marks help to escape
the \ character. - tr '\012' _at_ lt test
49First real application a word list
- You can make an alphabetical list or a frequency
list or just a word list . - Why?
- A word list shows the word usage
- type/token ratio measures the number of different
words divided by number of words in total. gt
shows diversity - It will be different from person to person and it
differs also according to age
50How to make an alphabetical word list?
- change all capitals to lower-case
- every word should be on one line
- remove doubles
- sort alphabetically
51How to make an alphabetical word list?
- change all capitals to lower-case
- tr 'A-Z' 'a-z' lt test
- every word should be on one line and squeeze (-s)
repeats of a space - tr -s ' \011\012' '\012' lt test
- sorting sort test
- remove doubles uniq
52How to make a frequency word list?
- We can put all this in one command by using pipes
- cat test tr 'A-Z' 'a-z' \
tr -s ' \011\012' '\012' sort uniq - if you want frequencies
- uniq -c (count)
- Then if you want to sort according to frequency
- sort -n (numerical)
- And if you want the highest frequency up
- sort -nr (numerical and reverse)
53Zipf's law
54Zipf's Law
- Zipf's law the observation of Harvard linguist
George Kingsley Zipf that the frequency of use of
the nth-most-frequently-used word in any natural
language is approximately inversely proportional
to n. - So, the second most common frequency will occur
1/2 as often as the first. The third most common
frequency will occur 1/3 as often as the first. - There are only very few words that occur often.
There are very many words that occur infrequently.
55Regular expressions(grep)
- We've seen wildcards
- Regular expressions build on the same idea but
important differences. - To show how regular expressions work
- grep ltreg_expgt It will return the lines of the
given file that match the regular expression. - grep a test It will return the lines from
file test that contain an 'a'
56Regular expressions(grep)
- Options for grep
- -c returns the number of lines matching the
given regular expression - -i ignore case distinction does not
differentiate between capital and lowercase
letters - -v inverse returns those lines that don't
match the condition
57Regular expressions
- the dot . matches any character (as ? in
wildcards) - grep a..le matches lines with apple etc
- any character within the brackets
- grep bfall
- gtwill match fall and ball
- a-z interval of characters grep a-z matches
any line that contains an alphabetical character. - complement of the listed characters
(anything except those) grep ab filename - matches any line that does not contain a or b
58Regular expressions
- Repetition of some sets
- Kleene-star (Kleene closure) the
repetition of the expression before it, any times
(even 0 times) a matches ' ' a aaaa - grep ba gt matches lines with bbbbba ba,
bbbbbbbbbbba etc - grep l.b gt matches lines that have a sequence
that starts with an 'l' then a sequence of any
characters and then a 'b'
59Regular expressions
- Position within the line
- beginning of the line (only at the
beginning, otherwise it matches itself) - Example grep aeiou filename
- end of the line (at the end of the
outermost expression, otherwise it matches
itself) - Example grep aeiou filename
60More examples
- oain either 'on' or 'an' or 'in'
0-90-9 two consecutive digits
aeiou a vowel at the beginning of the
line .aeiou a vowel at the second
position of a line aeiou a line
consisting exactly of a vowel 0-9
anything but a digit 0-9 a line
ending not ending with a digit d\-
a line beginning with a 'd' or a '-'
abb an 'a', followed by one or
more bs - 0-90-9 one or more digits
61End of week 2See you at the practicum!
62Frequencies
- We have made word lists in the practicum.
- What if we want a frequency list?
- The same but give uniq the -c option (for
counts). - What you get is a list of words with the
frequency attached to it. That is the absolute
frequency. - You can get the relative frequency of a word by
dividing the absolute frequency by the total
number of words in the corpus under
consideration. The frequency is then relative to
the corpus size.
63Type/token ratio (TTR)
- The type/token ratio gives an idea of the
richness in vocabulary. - How many words does this text have?
- Do you mean tokens or types?
- Number of tokens size of the corpus
- Number of types size of vocabulary of the
corpus - type/token ratio number of types divided by
number of tokens
64Type/token ratio (TTR) (cont'd)
- TTR is very different for texts of different
length. - Longer texts tend to have lower TTR values. Why?
- If you are working with texts of different
length, you have to use some form of
normalization. - For example split your texts up in texts of 1000
words and calculate the average TTR. - Or compare TTR versus text length
65How to compute type/token ratio?
- Tokens
- tr 'A-Z' 'a-z'
- tr -d 'punct'
- tr -s ' \012' '\012' wc -l
- Types
- tr 'A-Z' 'a-z'
- tr -d 'punct'
- tr -s ' \012' '\012'
- sort uniq wc -l
- NB to count tokens, what if you use wc -w
66Sed
- sed is a stream editor. A stream editor is
used to perform basic text transformations on an
input stream. - You might think hey didn't we see that already
with tr? Differences - sed works on string-level (regexps),
- tr works on character-level.
67Sed
- sed 's/regex/newstring/'
- rewrite the first string that matches the given
regular expression. - sed 's/regex/newstring/g'
- replace all instances regex ('g' for global).
68Sed
- sed '/regex/d'
- This rule will delete all lines that include
regex - sed 's/regex//'
- replacing with the empty string.
- The stands for the string that has been
matched. - echo "ik ben jan" sed 's/jan/-willem/'
- escape metacharacters when necessary
- sed 's/\//a/' lt will replace / into a
69Sed
- If you want to put more than one operation into
the command line (or you have a script file with
at least one operation in the command line), use
the -e option before each operations of the
command line -
- sed -e '/Henry/d' -e 's/Smith/White/g'
70selecting columns (cut)
- 0513678 John 8 0612942 Kathy 7 0418365
Pieter 6 0539482 Judith 9 - If you want to remove the names
- cut -c1-8,16-18 grades
71columns with delimiter
- 0513678 John 8 0612942 Kathy 7
0418365 Pieter 6 0539482 Judith 9 - cut -d -f 2
- The -d option of the command cut defines what the
delimiter is, and you can refer to a field by
giving its number after the option -f.
72merging columns(paste)
text
text
text
- cat
paste - paste -d char file file...
- Now how can you change the order of columns of a
given file 'info'? Combining cut and paste - cut -c1-8 info gt name
- cut -c9-14 info gt birthdate
- cut -c15-40 info gt address
- paste address birthdate name gt new_info
text
73Second application N-grams
- An N-gram is like a moving window over a text,
where N is the number of elements in the window
(bigrams, trigrams etc). - Why do we need N-grams?
- There is more information in combinations of two
or more things, than in single elements. - Language guesser a lot of languages share the
same alphabet, certain combinations are typical
for one language (aa for NL, sh for EN).
gthttp//www.let.rug.nl/vannoord/TextCat/Demo/ - text classification
74Making a bigram at word level
- Make a list of words starting from position 1
- Make a list of words starting from position 2
- paste these two lists
- Making a list with tr
- tr 'A-Z' 'a-z'tr -s ' ' '\012' gt list
- Making a list starting at position 2
- tail -n 2 lt list gt listplus2
- paste list listplus2 sort uniq -c sort -nr
75Interpunction and bigrams on word level
- Remove interpunction?
- John put on his hat and slept. Penguins were
walking. - John put, put on, on his, his hat are all
combinations that can be found in English. - slept penguins is not a common combination. It is
better to keep the line breaks / interpunction...
76Collocation
- Within the area of corpus linguistics,
collocation is defined as a pair of words (the
'node' and the 'collocate') which co-occur more
often than would be expected by chance. (from
Wikipedia) - A collocation is any turn of phrase or accepted
usage where somehow the whole is perceived to
have and existence beyond the sum of the parts.
(from Manning and Schütze, 1999) - Hard to determine the boundary between
collocations and other frequent word combinations
(co-occurrences).
77Examples of collocations
- Middle East
- President Bush
- real estate
- brute force
- take pictures (not make pictures)
- do a favour (not make a favour)
78Collocation of more than two words
- the term 'collocations' is also used for gt 2
words - (he) kicked the bucket
- (hij heeft) de pijp aan Maarten gegeven
- (hij heeft zonder commentaar) het veld geruimd
79Collocations (common features)
- Non-compositionality The meaning of a
collocation is not a straightforward composition
of the meaning of its parts. For example, the
meaning of kick the bucket has nothing to do with
kicking buckets. (it means 'to die') - Non-substitutability We cannot substitute a word
in a collocation with a related word. For
example, we cannot say yellow wine instead of
white wine although both yellow and white are the
names of colors. - Limited modifiability Adding modifiers or
syntactic transformations not always possible.
John kicked the green bucket or the bucket was
kicked has nothing to do with dying.
80Collocation check
- A trick that often works is to translate a
combination of words literally (word by word) in
another language and see if that works.
81Collocations
- words in a collocation do not have to be adjacent
- She is under a lot of pressure
- They made it up to him
- Hij gaat problemen altijd uit de weg
- Special types institutionalised phrases (phrases
that are syntactically/semantically compositional
but co-occurrence is conventionalised) - strong tea vs. powerful tea
- powerful computer vs. strong computer
82How to find collocations automatically?
- Using n-gram frequency counts?
- We learned how to make n-grams
- Perhaps the most frequent n-grams are
collocations?
83Problems with n-gram frequencies
- We get many frequent, but uninteresting
combinations (from Federalist papers) - 4021 of the
- 1494 to the
- 1440 in the
- 1174 Z the
- 872 to be
- 676 that the
- 612 by the
- 608 it is
-
84Remedy use linguistic info
- If we have a corpus that has part of speech tags,
- we can look for the combination of an adj and a
verb only or noun noun for English - Justeson and Katz (1995) have applied a
part-of-speech filter to identify likely
collocations. - Results are surprisingly good for such a simple
method. - First 100 bigrams of Federalist papers, checked
on NN or AN - 205 new york
- 193 united states
85How to do it using statistics (t test)
- What we want to know is whether two words occur
more often together than expected by chance. - on occurs frequently, the occurs frequently.
- on the occurs frequently.
- The question is 'Does it occur more frequently
than expected by chance. - We can use t test to determine that
86Other methods for hypothesis testing
- Pearson's chi-square test
- Does not assume normal distribution, unlike t
test. - Can be used to determine corpus similarity.
- Likelihood ratios
- More appropriate for sparse data.
- Easier to interpret It gives a number that says
how likely one hypothesis is over the other. - More info in Manning and Schütze, 1999. chapter 5
87Relative frequency ratios
- We already learned what relative frequencies are.
- We calculate the ratio of the relative frequency
of some bigram/word in two different corpora. - That way we can see for example what celebrity
was particularly hot in 2007 compared to 2006.
88Another statistical test point-wise mutual
information
- Compare the frequency of the bigram with the
expected frequency - assuming the elements of the
bigram are independent - The usual definition is in terms of
probabilities, we will use relative frequencies
instead
89Mutual information
- Sometimes also used to find if words like to
co-occur - f(w1) how many sentences contain w1
- f(w1,w2) how many sentences contain both w1 and
w2
90Making a bigram at character level
- Now you want a list of characters. What you do
is - lowercase all letters
- tr 'A-Z' 'a-z' lt filename
- replace each character with itself a newline
- sed 's/./\n/g' filename gt list
- The rest is the same as for the bigram at word
level.
91Making a bigram at character level (cont'd)
- But there are many empty lines. So better
- tr 'A-Z' 'a-z'
- tr -d '\012'
- tr -s ' '
- sed 's/./\n/g'
- First delete all newlines, then squeeze all
repeats of a space.
92Concordance
- A concordance is an alphabetical list of the
principal words used in a book or body of work,
with their immediate contexts. (From Wikipedia,
the free encyclopedia) - historical concordances (manually built, years of
work!) - The Bible
- The Quran
- The works of Shakespeare
93Cruden's Concordance (1736) of the Bible
- dry ground
- behold the face of the ground was d. Gen 813
- Israel shall go on d. ground in the sea Ex 1416,
22 - stood firm on d. ground in the sea Josh 317
- Elijah and Elisha went over on d. ground 2 Ki 28
- he turneth water-springs into d. ground Ps 10733
- he turneth d. ground into water-springs 35
- I will pour floods upon the d. ground Isa 443
- He shall grow as a root out of a d. ground 532
- She is planted in a d. and thirsty ground Ezek
1913
94Keyword in Context (KWIC)
- Volkskrant 97
- terwijl het nettoloon op peil blijft .
- en eigenliefde scoren beneden peil .
- club en het internationale peil .
- aandelenkoersen op het laagste peil van de
afgelopen drie - blijft op het huidige peil van
zestigduizend gulden , - om de accommodatie op peil te brengen .
- Can be done automatically, with the UNIX
utilities we have learned. We will do that in the
practicum.
95Why are concordances/KWIC so interesting?
- language use in context
- word senses defined by their surrounding context
- qualitative analyses of concordance lines
- frequencies can be calculated from concordances
- intuitions/hypothesis can be validated
- new hypotheses can be formulated from
concordances - lexicographers love KWIC tools!
96Third application KWIC
- KWIC (Keyword in context) a table with the left
and right context of a word - KWICs are handy for example for translators.
- Example
- 'verband' would translate to bandage, but if we
look in context - 'In verband met' we don't want to translate with
in bandage with
97How do we make a KWIC?
- grep to select lines (this already is a KWIC in
fact) - cut to select context
- sed cut to give the contexts all the same length
- paste to combine contexts
98KWIC -the commands
- grep itself test sed -e 's/Eeen//'gtline
s - cut -d -f 1 lt lines gt before
- cut -d -f 2 lt lines gt itself
- cut -d -f 3 lt lines gt after
- sed -e 's// /' lt
before gt before2 - sed -e 's// /' lt
after gt after2 - cut -c 1-30 lt after2 gt after32
- rev before2 cut -c 1-30 rev gt before3
- paste before3 itself after3
99The end of week 3, see you at the test
100Reading
- Read the following 4 pages BEFORE tomorrow's
practicum - p162-p166 section 5.3 untill 5.3.2
- Manning, C. and Schütze, H. 1999. Foundations of
Statistical Natural Language Processing. Ch 5 - Available from
- http//nlp.stanford.edu/fsnlp/promo/colloc.pdf
101References
- Justeson J. S. and Katz, S. M. 1995. Technical
terminology some linguistic properties and an
algorithm for identification in text. Natural
Language Engineering 19-27 - Manning, C. and Schütze, H. 1999. Foundations of
Statistical Natural Language Processing. Ch 5 - Available from
- http//nlp.stanford.edu/fsnlp/promo/colloc.pdf
102variables
- list of variables by typing set
- better set grep PATH
- examples of variables
- PATH a set of paths that are checked when you
give a command - HOME the path to the home directory of the user
- PWD the working directory
- You can set variables by typing
SOME_VARIABLEsome_value - to let shelll know you are talking about the
variable echo SOME_VARIABLE
103Annotations
- To annotate add extra data to the corpus
- Extra-textual information/ meta data (title
author, date of creation) - Linguistic information
- word class (Part-of-speech) (noun/verb/adj/...)
- lemma
- syntactic info
- semantic
- phonetic
- ...
104SUSANNE corpus
POS wordForm Lemma N120510g - PPHS1m
He he N120510h - VVDv studied
study N120510i - AT the
the N120510j - NN1c problem
problem N120510k - IF for
for N120510m - DD221 a a N120510n
- DD222 few few N120510p - NNT2
seconds second N120520a - CC and
and N120520b - VVDv thought think
....
105Syntactic annotation
S NP Claudia NP VP sat
PP on NP a stoneNP
PP VP S
106Syntactic annotation in Alpino
ltnode rel"whd" index"1" frame"wh_tmp_adverb"
pos"adv" begin"0" end"1" root"wanneer"
word"Wanneer" wh"ywh" special"tmp"/gt
ltnode rel"body" cat"sv1" begin"0" end"6"gt
ltnode rel"mod" index"1"/gt ltnode rel"hd"
frame"verb(hebben,past(sg),part_intransitive(plaa
ts))" pos"verb" begin"1" end"2"
root"vind_plaats" word"vond"
sc"part_intransitive(plaats)" infl"past(sg)"/gt
ltnode rel"su" cat"np" begin"2" end"5"gt
ltnode rel"det" frame"determiner(de)"
pos"det" begin"2" end"3" root"de"
word"de" infl"de"/gt ltnode rel"mod"
frame"adjective(e,adv)" pos"adj"
begin"3" end"4" root"Duits" word"Duitse"
infl"e"/gt ltnode rel"hd"
frame"noun(de,count,sg)" pos"noun"
begin"4" end"5" root"hereniging"
word"hereniging" gen"de" num"sg"/gt
lt/nodegt ltnode rel"svp" frame"particle(plaa
ts)" pos"part" begin"5" end"6"
root"plaats" word"plaats"/gt lt/nodegt lt/nodegt
107XML
- XML EXtensible Markup Language
- XML is a markup language much like HTML
- XML tags are not predefined. You must define your
own tags - XML was created to structure, store and to send
information
108Shell scripts
- Wouldn't it be nice to save our often long
commands in files? - cat gt a_simple_shell_script
- echo Now I will list the subdirectories of the
directories whose name contains exactly 4
characters. - ls -l ???? grep d
- echo Thank you for your waiting.
- echo What about an alphabetical order of these?
- ls -l ???? grep d sort
- echo Here you have it.
- ctrl-d
109Links
- pointing to the same file from different places
and/or with different names - ln ltexisting-filegt ltnew_namegt
- In long list you can see number of hard links
- For difference between hard and soft (-s) link
see Tamas files
110Protocols
- Imagine you are at home and you want to log in to
one of the computers in the UNIX room. - Protocols are standards of communication between
different systems that may be far away from
eachother and operate in different ways - telnet or ssh (Secure Shell Protocol), but no
graphical interface - ftp (File Transfer Protocol) to transfer files
from one machine to another.
111shell scripts
- To let the shell know that this is a program that
we want to execute, to make the file executable - chmod x
- Imagine you are going to make loads, then you
want to store them in a dir - To let the system know the place to look for this
command when you type it, add it to your PATH
variable - PATHPATHHOME/shellscripts
112shell scripts
- Now what if you want to give arguments ? You
don't always want to look for dir with 4
characters , sometimes 3 and you don't want to
change your programme all the time. - refer with 1, 2 to first and second word after
the scripts name. - ls -l ???? grep d ls -l 1 grep
d - and you type in
- a_simple_shell_script '???'
113shell scripts
- control structures
- case
- if
- for
- while
- until
114shell scripts
- case ltselectorgt in
- ltvalue1gt ) ltcommands1gt
- ltvalue2gt ) ltcommands2gt
- ltvalue3gt ) ltcommands3gt
- ...
- ltvalueNgt ) ltcommandsNgt
- esac
115if
- if ltcommands1gt then ltcommands2gt else
ltcommands3gt fi
116for, while, until
- for ltvariablegt in ltlistgt do ltcommand command
...gt done - while ltcommand command ... gt do ltcommand
command ..gt done - until ltcommand command ...gt do ltcommand command
... gt done
117How to kill a process
- Imagine you are running a shell-script but it
goes into an infinite loop, you want to stop it. - type ps (it will show you all your processes
- check the process ID of the process to be killed
- kill -9 ltprocess_idgt
118N-gram-based text categorization
- Based on N-grams found in text a company can
automatically categorize texts. - by subject (what is it about), comparing the most
typical words. - or by language (where is it from), using N-grams
on character level. - Article about this on website under 'Literature'
- William B. Canvar en John M. Trenkle
N-gram-Based Text Categorization
119Permissions ls -l
- very first character '-' for simple file, 'd'
for directory, 'l' for symbolic link, etc. - 3 times 3 character permission for user, group
and others r permission to read, w permission
to write, x permission to execute - number of links belonging to this file
- user and the group owning the file
- size of the file in characters ('total' on the
top total number of disk blocks occupied by the
listed files) - Date and time when the file was last modified
120Permissions (chmod)
- chmod changing the permissions of a file.
- u user, g group, o others, a all (also
ug, uo etc. ) - give permission - remove permission r
to read, w to write, x to execute - chmod gw filename or chmod o-x filename
- 4 permission to read, 2 permission to write,
1 permission to execute - chmod 753 filename
- rwx to owner, r-x to group and -wx to others.
121Coding different alphabets
- Three levels
- keyboard layout Which key which character
(QWERTY, QWERTZ, AZERTY) - code page a number (signal from keyboard) is
translated to character (on your screen) - font exact graphical image (e.g. Times New
Roman) - Be aware of behavior of intervening programs
(shell, editor, xterm, less, cat)
122Character encoding
- How to store character in bits and bytes.
- Each character has a unique code.
- 1 bit has two states (0/1).
- 8 bits (1 byte) can store 256 values.
- 16 bit can store 65536 values.
- Different languages require different encoding
tables depending on the alphabet. English (a
language without diacritics) can store all
characters needed in just one byte.
123Character encoding, historical perspective
- First only English alphabet 7 bits was enough
(ASCII) - Most computers had 8 bits.
- With the last bit lots of people from different
parts of the world did different things. (codes
128-255) - On some PCs the character code 130 would display
as é, but on computers sold in Israel it was the
Hebrew letter Gimel ? - Standard ANSI They agreed on what to do with the
codes below 128, the rest was up to where you
live.
124Character encoding, historical perspective
- Then came Unicode.
- All languages in one system, so possible to have
document with different languages. - UTF-8 is a system for storing your string of
Unicode code points in memory. In UTF-8, every
code point from 0-127 is stored in a single byte.
Only code points 128 and above are stored using
2, 3, in fact, up to 6 bytes. - English text looks the same in UTF-8 as they do
in ASCII. - Latin-1 is useful for any Western European
language, but not for Russian or Hebrew.
125What to remember about encoding
- Be aware of encoding issues.
- UTF-8 is more and more common if you use this,
then noone will question it - For some languages, latin1 encoding is OK too