Text File Processing - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Text File Processing

Description:

cat catenate files. more/page navigate a page or line at a time ... catenate. Dictionary.com says 'to link together; form into a connected series' ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 30
Provided by: mer95
Category:

less

Transcript and Presenter's Notes

Title: Text File Processing


1
Text File Processing
  • CSCI 2467 Spring 2007
  • Instructor Michael Ruth
  • Computer Science Department
  • University of New Orleans
  • mruth_at_cs.uno.edu

2
Topics
  • There are a number of commands for dealing with
    textual files
  • They fit into roughly four categories
  • Display all of it in some way
  • Display a specific part of it
  • Change the file in some way
  • Compare this file with other files

3
Displaying the file (text-only)
  • Remember these commands only have defined
    behavior when run on text files
  • Some basic commands
  • cat catenate files
  • more/page navigate a page or line at a time
  • Man uses this, so its good to know
  • less ? not a joke I swear
  • Similar to more except that less allows pattern
    searches

4
catenate
  • Dictionary.com says to link together form into
    a connected series
  • simply prints the contents of the file or files
    you specify as arguments to it on its standard
    output.
  • The syntax for the cat command is
  • cat ltfilegt...

5
more/page
  • These allow you to read text files page by page
  • The syntax of the command is
  • more ltfilegt...
  • If you use more than one file, it seperates them
    with the file name surrounded by a short line of
    like so
  • ltfilenamegt
  • It is possible to navigate through the file
  • q quit
  • spacebar --- go forward one page
  • return go forward in the file one line
  • b go back one page

6
less
  • Actually performs the exact same work as more
  • Except that pattern matching is possible
  • The syntax is less ltfilegt...
  • To perform pattern matching, use the
  • / ltregular expressiongt
  • well get back to this later

7
Cat and multiple files
  • When using cat with multiple files it becomes a
    special kind of filter
  • Combines inputs into one output

file1
cat file1 file2 file3
standard out
file2
file3
8
Most Unix Commands
  • Much earlier in our discussions of Unix, we said
    that pipes are a major design decision
  • Most Unix commands are filters

command
standard input
standard output
9
Standard Streams
  • There are three standard streams
  • Input
  • Input into the program (from keyboard)
  • Output
  • Output from the program (to screen/terminal)
  • Error
  • Output from the program (to screen/terminal)
  • These are useless right?

10
Unix and Redirection
  • You can redirect streams to files
  • Makes even more sense that we think of everything
    being files
  • We can redirect
  • Standard output
  • Standard input
  • Standard error
  • Pipe from one command directly to another

11
Redirection to a file
  • Redirecting standard output
  • ltcommand linegt gt ltfilenamegt
  • Ex cat file1 file2 gt file3
  • Catenates file1 and file2 and saves that to file3
  • We can also append to a file
  • ltcommand linegt gtgt ltfilenamegt
  • Ex cat file1 gtgt file2

12
Redirection from a file
  • Redirecting standard input
  • ltcommand linegt lt ltfilenamegt
  • Ex cat lt file3
  • Exactly the same as cat file3
  • NOTE
  • Generally, you can use file matching
    metacharacters if a file name is an option, but
    NOT if the file name is the source for input
    redirection
  • Redirection must happen from an explicitly
    specified file or device.

13
Error Redirection
  • One does NOT normally do this, as errors are
    traditionally sent to the screen
  • Or in the case of daemons to log files and the
    screen
  • Redirecting standard error (bourne and
    descendants)
  • ltcommand linegt 2gt ltfilenamegt

14
Piping (useful!)
  • We can redirect input and output at the same
    time
  • command lt input gt output
  • Useful, but there is another way
  • cat input command gt output
  • It can be used to hook an arbitrary number of
    programs together that use one anothers input
  • Suppose we run prog1 on input1 and produce
    output1
  • Suppose we run prog2 on output1 and produce
    output2
  • Suppose we run prog3 on output2 and produce
    output3
  • We can simply run
  • prog1 lt input1 prog2 prog3 gt output3
  • An important tenet of the Unix philosophy has
    been to try to develop every program (at least
    every command line program) so that it is a
    filter rather than just a stand-alone program.

15
Viewing just a part of a file
  • tail
  • Displays the last part of the file
  • -n where n is a number (the number of lines
    shown)
  • head
  • Displays the first part of the file
  • -n where n is a number (the number of lines
    shown)
  • wc
  • -c Counts bytes
  • -w Counts words
  • -l Counts lines
  • -m Counts characters

16
Changing the file in some way
  • expand/unexpand
  • Expands the tabs into spaces or vice versa
  • unix2dos
  • UNIX to DOS text file format converter
  • dos2unix
  • DOS to Unix test file format converter
  • uniq
  • Filters out / reports the duplicate lines (that
    are adjacent) in a file
  • spell
  • Returns a list of incorrectly spelled words in a
    file

17
Sorting a file
  • Sorting input files is often very necessary task
  • The syntax is
  • sort options filenames
  • Data can be sorted in a number of ways
  • -n numeric or alphanumeric
  • -r in ascending or descending order
  • Additionally, if multiple input files are given,
    the data from each file is merged during the sort

18
split/csplit
  • Split a file into pieces
  • Syntax of the command is
  • 2 ways to do it
  • By number of lines
  • split -l n ltfilenamegt name
  • Where n is number of lines
  • By byte size
  • split -b n km ltfilenamegt name
  • Where n is number of bytes
  • Where k/m makes n number of kilo-/mega- bytes
  • csplit allows the same except it allows for
    pattern matching based splitting see the man
    page for more details

19
split is the opposite of cat(combining)
file1
cat file1 file2 file3
standard out
file2
file3
file1
standard in
split file4
file2
file3
20
Metacharacters
  • I know I promised this sooner, but here we are
  • Metacharacters are a group of characters that
    have special meanings to the UNIX operating
    system
  • Technically, weve seen quite a few already
  • lt,gtgt,gt,
  • Now, we will see a few more
  • But not all, some are best to wait
  • Used to separate multiple commands on the same
    line

21
Wildcards
  • matches any group of characters of any length,
    allowing a user to specify a large group of items
    with a short string.
  • ?
  • wild card character that matches any single
    character
  • ltcharsgt
  • A set of characters that can be matched
  • Can be listed as range like so a-c

22
Suppose
  • Suppose we wanted to match
  • All files that
  • Started with mike
  • Had an a or c just before the extension
  • Ended in three letter extension (such as wav,
    mp3, etc)
  • We could specify it as
  • ls mikeac.???
  • This might be an overcomplicated example, but its
    straightforward (think about VINs)

23
File Comparisons
  • These are some of the more common file comparison
    tools built into Unix
  • Programmers use a variety of these fairly often
  • These commands are
  • cmp
  • comm
  • diff
  • diff3
  • dircmp

24
Directory Comparisons
  • Lists files in both directories and indicates
    whether the files in the directories are the same
    and/or different
  • Syntax
  • dircmp options directory1 directory2
  • Many useful options of interest worth checking
    out (using man of course)

25
comparison of any file types
  • Compares two files of any type and writes the
    results to the standard output
  • By default, cmp is silent if the files are the
    same if they differ, the byte and line number at
    which the first difference occurred is reported.
  • cmp options file1 file2
  • The following is the most common option
  • -s Print nothing for differing files return
    exit status only
  • The cmp utility exits with one of the following
    values
  • 0 The files are identical.
  • 1 The files are different

26
Compare sorted text files
  • Compares two sorted files and displays the
    instances that are common.
  • The display is separated into 3 columns.
  • The first column displays what occurs only in the
    first
  • The second column displays what occurs only in
    the second
  • The third column displays what occurs in both
  • The syntax is
  • comm options filename1 filename2
  • There are three options of note
  • -1 suppress the first column
  • -2 suppress the second column
  • -3 suppress the third column

27
Difference files
  • the diff command compares text files.
  • It gives an index of all the lines that differ in
    the two files along with the line numbers.
  • It also displays what needs to be changed
  • Extremely important to programmers in a Unix
    environment
  • The output of this command can be used in a
    patch program (which we wont discuss)
  • The syntax is
  • diff options file1 file2
  • diff3 options file1 file2 file3

28
The future
  • Processes and Job Control
  • Additionally wrap-up of shell commands not
    already covered (which are important)
  • Shell Programming Regular Expressions
  • Introduction to C Programming

29
Questions?
Write a Comment
User Comments (0)
About PowerShow.com