filters that handle multiple files - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

filters that handle multiple files

Description:

shell scripts. NLP and Corpus resources: word lists. 2. Filters that handle multiple files ... A program or script is a set of instructions to the shell. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 30
Provided by: ralphgr
Category:

less

Transcript and Presenter's Notes

Title: filters that handle multiple files


1
Lecture 6
  • filters that handle multiple files
  • shell scripts
  • NLP and Corpus resources word lists

2
Filters that handle multiple files
  • comm - compare 2 sorted files line by line
  • sdiff - print differences between 2 files side-
  • by-side
  • cat - concatenates files
  • paste - merge lines of files
  • split - split a file into pieces

3
comm compares 2 sorted files
  • Input file1 file2
  • Arnaud Chomsky
  • Descartes Descartes
  • Kant Fodor
  • Lancelot Pinker
  • comm file1 file2
  • Arnaud Chomsky Descartes
  • Kant Fodor
  • Lancelot Pinker

4
comm options
  • comm -12 outputs data that is common to f1 and
    f2
  • Descartes (subtract col. 1 and 2 give only col.
    3)
  • comm -23 outputs data unique to f1
  • Arnaud (subtract col 2 and 3)
  • Kant
  • Lancelot
  • comm -13 outputs data unique to f2
  • Chomsky (subtract col. 1 and 3)
  • Fodor
  • Pinker

5
The sdiff command
  • File1 File2
  • This This
  • is is
  • a a
  • test text
  • file file
  • sdiff File1 File2
  • This This
  • is
    is
  • a a
  • test
    text ? marks lines that differ
  • file.
    file.

6
The sdiff command (2)
  • File1 File2
  • This This
  • is is
  • a a
  • test file
  • file
  • sdiff File1 File2
  • This This
  • is
    is
  • a a
  • test lt ? marks unique
    occurrence in File 1
  • file.
    file.

7
The sdiff command (3)
  • File1 File2
  • This This
  • is is
  • a a
  • file test
  • file
  • sdiff File1 File2
  • This This
  • is
    is
  • a a
  • gt test ?
    marks unique occurrence
  • file.
    file. in File 2

8
sdiff applied
  • In In
  • this this
  • story story
  • we we
  • can can
  • see see
  • the/0 the
  • trouble trouble
  • in in
  • the the
  • Feeney's/Feeney Feeney's/Feeney
  • Family Family/family
  • ,/. ,/.
  • two/Two two/Two
  • of of
  • their/the_Feeney's their/the
  • kids/children kids/children
  • , ,
  • Mary Mary
  • and and
  • Michael, Michael,
  • had had

9
The cat command
  • cat file1 gtgt file2
  • append file1 to file2
  • cat file1 file2 file3 file4 gtgt file5
  • append files1-4 to file5
  • cat -n file1 print file with line numbers

10
The paste command
  • glues corresponding lines in different files
    together
  • merged fields are separated by a TAB
  • paste -d
  • allows a separator other than TAB
  • paste -s
  • pastes together lines from the same file

11
The split command
  • splits a file into pieces
  • split -500 carol12.txt carol
  • wc -l carol
  • 3836 carol12.txt
  • 500 carolaa
  • 500 carolab
  • 500 carolac
  • 500 carolad
  • 500 carolae
  • 500 carolaf
  • 500 carolag
  • 336 carolah

12
The Unix Concept of a Shell
  • The Unix OS has several command processors that
    stand between the user and the OS kernel as
    go-betweens. These go-betweens are called
    shells.
  • The shells are large programs (written in C) so
    they have man pages
  • man bash

13
Assorted Shells
  • There are 2 families of shells
  • 1. The Bourne shell (includes sh and bash)
  • we will use bash for most of our
    programming
  • 2. The C shell (csh)
  • we will use the C shell for part of speech
    tagging and parsing

14
Login through the Shell
  • /bin/sh
  • Linux
  • System
  • Kernel /bin/csh

memory
15
Shell Capabilities
  • File shorthand
  • Input/Output Redirection
  • Personalization of your environment including
    ability to process programs you write. These
    programs are called shell programs, or shell
    scripts.

16
File Shorthand
  • Files
  • chap1 chap3 chap5
  • chap2 chap4 chap6
  • Longhand
  • chap1 chap2 chap3 . . .
  • Shorthand
  • chap

17
Command Line Variables
  • split babfc10.txt
  • 0 1

18
Input/Output Redirection
  • Output
  • ls gt myfile creates myfile
  • date gtgt myfile appends date to myfile
  • ls more pipes output of ls
  • into more
  • Input
  • tr \n lt sonnets

19
Shell Programming
  • This new command-line-in-a-file is called a
    program or a script.
  • A program or script is a set of instructions to
    the shell.
  • The type of program that uses commands from a
    Linux shell (grep, tr, sort, etc.) is called a
    shell script.

20
Advantages of Shell Scripting (1)
  • A script can be re-used.
  • You created your lexical growth output by writing
    command after command on the command line.
  • If you store these commands in a script, you can
    use them again.

21
!/bin/sh lexy.sts split babfc10.txt tr -cs
'A-z0-9' '\n' lt xaa sort -u gt xaa.unq tr -cs
'A-z0-9' '\n' lt xab sort -u gt xab.unq tr -cs
'A-z0-9' '\n' lt xac sort -u gt xac.unq tr -cs
'A-z0-9' '\n' lt xad sort -u gt xad.unq wc -w
xaa.unq gt stats.sts comm -13 xaa.unq xab.unq
wc -w gtgt stats.sts cat xaa.unq xab.unq sort -u
gt prev comm -13 prev xac.unq wc -w gtgt
stats.sts cat prev xac.unq sort -u gtgt
prev1 comm -13 prev1 xad.unq wc -w gtgt
stats.sts
22
Advantages of Shell Scripting (2)
  • You can make the script apply to new files (a
    more powerful type of re-use) by using
  • command line variables
  • file shorthand

23
!/bin/sh lexy.sts rm x
file shorthand rm pre split 1 command line
variable tr -cs 'A-z0-9' '\n' lt xaa sort -u gt
xaa.unq tr -cs 'A-z0-9' '\n' lt xab sort -u gt
xab.unq tr -cs 'A-z0-9' '\n' lt xac sort -u gt
xac.unq tr -cs 'A-z0-9' '\n' lt xad sort -u gt
xad.unq wc -w xaa.unq gt stats.sts comm -13
xaa.unq xab.unq wc -w gtgt stats.sts . . . . . .
.
24
Advantages of Shell Scripting (3)- You can
actually program in a script
  • tr -cs 'A-z0-9' '\n' lt xaa sort -u
  • gt xaa.unq
  • tr -cs 'A-z0-9' '\n' lt xab sort -u
  • gt xab.unq
  • tr -cs 'A-z0-9' '\n' lt xac sort -u
  • gt xac.unq
  • tr -cs 'A-z0-9' '\n' lt xad sort -u
  • gt xad.unq
  • for file in x
  • do
  • tr -cs 'A-z0-9' '\n' lt file sort -u gt
    file.out
  • done

25
Example of a Shell Script
26
!/bin/sh lexy.loop rm x rm .out split
1 for file in x do tr -cs 'A-z0-9'
'\n' lt file sort -u gt file.out done echo
xaa.out wc -w lt xaa.out gt stats.loop mv xaa.out
prev1 for file in .out do mv prev1 prev
comm -13 prev file echo file wc -w gtgt
stats.loop cat file gtgt prev sort -u prev
gtgt prev1 done
27
lexy.loop output
  • fitzpatricke_at_baboon lex.growth more stats.loop
  • xaa.out 1621
  • xab.out 646
  • xac.out 550
  • xad.out 371
  • xae.out 364
  • xaf.out 279
  • xag.out 341
  • xah.out 326
  • xai.out 77

28
A shell script is a program
  • It must be executable
  • chmod ux my_script
  • It must be run with ./ so bash knows to find it
    in the current directory
  • ./my_script sonnets

29
NLP and Corpus resources word lists
  • Word list applications
  • Spell checking
  • Genre identification
  • Assessing difficulty of a text
  • Sample Word Lists
  • The General Service List
  • The Academic Word List
Write a Comment
User Comments (0)
About PowerShow.com