Title: filters that handle multiple files
1Lecture 6
- filters that handle multiple files
- shell scripts
- NLP and Corpus resources word lists
2Filters that handle multiple files
- comm - compare 2 sorted files line by line
- sdiff - print differences between 2 files side-
- by-side
- cat - concatenates files
- paste - merge lines of files
- split - split a file into pieces
3comm compares 2 sorted files
- Input file1 file2
- Arnaud Chomsky
- Descartes Descartes
- Kant Fodor
- Lancelot Pinker
- comm file1 file2
- Arnaud Chomsky Descartes
- Kant Fodor
- Lancelot Pinker
4comm options
- comm -12 outputs data that is common to f1 and
f2 - Descartes (subtract col. 1 and 2 give only col.
3) - comm -23 outputs data unique to f1
- Arnaud (subtract col 2 and 3)
- Kant
- Lancelot
- comm -13 outputs data unique to f2
- Chomsky (subtract col. 1 and 3)
- Fodor
- Pinker
5The sdiff command
- File1 File2
- This This
- is is
- a a
- test text
- file file
- sdiff File1 File2
- This This
- is
is - a a
- test
text ? marks lines that differ - file.
file.
6The sdiff command (2)
- File1 File2
- This This
- is is
- a a
- test file
- file
- sdiff File1 File2
- This This
- is
is - a a
- test lt ? marks unique
occurrence in File 1 - file.
file.
7The sdiff command (3)
- File1 File2
- This This
- is is
- a a
- file test
- file
- sdiff File1 File2
- This This
- is
is - a a
- gt test ?
marks unique occurrence - file.
file. in File 2
8sdiff applied
- In In
- this this
- story story
- we we
- can can
- see see
- the/0 the
- trouble trouble
- in in
- the the
- Feeney's/Feeney Feeney's/Feeney
- Family Family/family
- ,/. ,/.
- two/Two two/Two
- of of
- their/the_Feeney's their/the
- kids/children kids/children
- , ,
- Mary Mary
- and and
- Michael, Michael,
- had had
9The cat command
- cat file1 gtgt file2
- append file1 to file2
- cat file1 file2 file3 file4 gtgt file5
- append files1-4 to file5
- cat -n file1 print file with line numbers
10The paste command
- glues corresponding lines in different files
together - merged fields are separated by a TAB
- paste -d
- allows a separator other than TAB
- paste -s
- pastes together lines from the same file
11The split command
- splits a file into pieces
- split -500 carol12.txt carol
- wc -l carol
- 3836 carol12.txt
- 500 carolaa
- 500 carolab
- 500 carolac
- 500 carolad
- 500 carolae
- 500 carolaf
- 500 carolag
- 336 carolah
12The Unix Concept of a Shell
- The Unix OS has several command processors that
stand between the user and the OS kernel as
go-betweens. These go-betweens are called
shells. - The shells are large programs (written in C) so
they have man pages - man bash
13Assorted Shells
- There are 2 families of shells
- 1. The Bourne shell (includes sh and bash)
- we will use bash for most of our
programming - 2. The C shell (csh)
- we will use the C shell for part of speech
tagging and parsing
14Login through the Shell
- /bin/sh
- Linux
- System
- Kernel /bin/csh
memory
15Shell Capabilities
- File shorthand
- Input/Output Redirection
- Personalization of your environment including
ability to process programs you write. These
programs are called shell programs, or shell
scripts.
16File Shorthand
- Files
- chap1 chap3 chap5
- chap2 chap4 chap6
- Longhand
- chap1 chap2 chap3 . . .
- Shorthand
- chap
17Command Line Variables
18Input/Output Redirection
- Output
- ls gt myfile creates myfile
- date gtgt myfile appends date to myfile
- ls more pipes output of ls
- into more
- Input
- tr \n lt sonnets
19Shell Programming
- This new command-line-in-a-file is called a
program or a script. - A program or script is a set of instructions to
the shell. - The type of program that uses commands from a
Linux shell (grep, tr, sort, etc.) is called a
shell script.
20Advantages of Shell Scripting (1)
- A script can be re-used.
- You created your lexical growth output by writing
command after command on the command line. - If you store these commands in a script, you can
use them again.
21!/bin/sh lexy.sts split babfc10.txt tr -cs
'A-z0-9' '\n' lt xaa sort -u gt xaa.unq tr -cs
'A-z0-9' '\n' lt xab sort -u gt xab.unq tr -cs
'A-z0-9' '\n' lt xac sort -u gt xac.unq tr -cs
'A-z0-9' '\n' lt xad sort -u gt xad.unq wc -w
xaa.unq gt stats.sts comm -13 xaa.unq xab.unq
wc -w gtgt stats.sts cat xaa.unq xab.unq sort -u
gt prev comm -13 prev xac.unq wc -w gtgt
stats.sts cat prev xac.unq sort -u gtgt
prev1 comm -13 prev1 xad.unq wc -w gtgt
stats.sts
22Advantages of Shell Scripting (2)
- You can make the script apply to new files (a
more powerful type of re-use) by using - command line variables
- file shorthand
23!/bin/sh lexy.sts rm x
file shorthand rm pre split 1 command line
variable tr -cs 'A-z0-9' '\n' lt xaa sort -u gt
xaa.unq tr -cs 'A-z0-9' '\n' lt xab sort -u gt
xab.unq tr -cs 'A-z0-9' '\n' lt xac sort -u gt
xac.unq tr -cs 'A-z0-9' '\n' lt xad sort -u gt
xad.unq wc -w xaa.unq gt stats.sts comm -13
xaa.unq xab.unq wc -w gtgt stats.sts . . . . . .
.
24Advantages of Shell Scripting (3)- You can
actually program in a script
- tr -cs 'A-z0-9' '\n' lt xaa sort -u
- gt xaa.unq
- tr -cs 'A-z0-9' '\n' lt xab sort -u
- gt xab.unq
- tr -cs 'A-z0-9' '\n' lt xac sort -u
- gt xac.unq
- tr -cs 'A-z0-9' '\n' lt xad sort -u
- gt xad.unq
- for file in x
- do
- tr -cs 'A-z0-9' '\n' lt file sort -u gt
file.out - done
25Example of a Shell Script
26!/bin/sh lexy.loop rm x rm .out split
1 for file in x do tr -cs 'A-z0-9'
'\n' lt file sort -u gt file.out done echo
xaa.out wc -w lt xaa.out gt stats.loop mv xaa.out
prev1 for file in .out do mv prev1 prev
comm -13 prev file echo file wc -w gtgt
stats.loop cat file gtgt prev sort -u prev
gtgt prev1 done
27lexy.loop output
- fitzpatricke_at_baboon lex.growth more stats.loop
- xaa.out 1621
- xab.out 646
- xac.out 550
- xad.out 371
- xae.out 364
- xaf.out 279
- xag.out 341
- xah.out 326
- xai.out 77
28A shell script is a program
- It must be executable
- chmod ux my_script
- It must be run with ./ so bash knows to find it
in the current directory - ./my_script sonnets
29NLP and Corpus resources word lists
- Word list applications
- Spell checking
- Genre identification
- Assessing difficulty of a text
- Sample Word Lists
- The General Service List
- The Academic Word List