An introduction to techniques for text data analysis - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

An introduction to techniques for text data analysis

Description:

head prints only the first 10 lines of a file. use option '-n', replacing n with a number to print a different number of lines ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 18
Provided by: sddark
Category:

less

Transcript and Presenter's Notes

Title: An introduction to techniques for text data analysis


1
An introduction to techniques for text data
analysis
  • Josh Wortman
  • IEM Department
  • Technion

2
Why develop computer assisted methods for
analyzing text data?
  • To process large amounts of non-parametric data
    quickly
  • To avoid dealing with inter observer agreement
    challenges
  • To avoid dealing with noise in self reports
  • Allows replication of analysis throughout time or
    on other data sets
  • To find patterns in language use that are not
    measured by current data methods
  • To gain new experimental insight and make more
    novel hypotheses

3
Steps to analyze text data
  • Check data completeness and labeling
  • Do some eyeball checks
  • Think about the trajectory of your analysis
  • Choose an analysis environment
  • Excel
  • you already know how to use it
  • there are some nice functions
  • overall limited tool set
  • Unix
  • more of a learning curve
  • requires programming logic
  • unlimited analytic techniques possible
  • Develop code, queries, or variables
  • Analyze results

4
What do we get with Excel?
  • Quick high level analyses
  • length
  • word count
  • Directed analyses when you are looking for
    something very specific
  • Relevant functions in Excel
  • Clean()
  • Trim()
  • Len()
  • Substitute()
  • Len()-Substitute

5
What do we get with Unix?
  • Command line functions
  • Unlimited analysis control more freedom to
    explore data
  • Regular Expressions
  • Awk and Perl text mining tool boxes
  • Summarization functionality
  • Build a collection of tools you use often

6
Getting the data into a unix environment
  • Google Putty, download putty.exe, put icon on
    desktop
  • Google WinScp, download winscp.exe, install it.
  • Export the data from Excel as tab delimited
  • - note save filename without spaces
  • Open WinScp, connect to your server account, copy
    the Excel file there
  • Open Putty.exe, logon to server account

7
Getting familiar with unix
  • cd move between directories
  • pwd see where you are (print working directory)
  • ls shows files in directory
  • use option -l to see more information
  • cat print an entire file onto the screen
  • less read and navigate through a file
  • use option -S to display data nicer
  • wc counts number of lines in a file
  • grep searches for text, works like find in MS
    Word
  • e.g. gtgt grep how are you? file.txt
  • sort sorts data in file
  • use option -n to sort in numerical order
  • use option -r to sort in reverse order

8
Getting familiar with unix
  • uniq shows all the unique values that exist,
    requires data to be sorted first
  • use option -c to count how many of each
    value exist
  • head prints only the first 10 lines of a file
  • use option -n, replacing n with a number to
    print a different number of lines
  • gt this is called the redirect character, it
    tells Unix to write to a file
  • e.g. gtgt sort file.txt gt file.sort.txt
  • this is called pipe and tells Unix to
    process the command before it then use the
    result as an input for what comes after it.
  • e.g. gtgt sort file.txt uniq

9
Getting familiar with unix
  • man shows manual for each Unix command
  • note not always so easy to read
  • e.g. gtgt man sort
  • dos2unix / unix2dos
  • converts file encoding to improve readability
  • gawk / awk
  • your new friend. Use this easy scripting
    language to
  • do analyses

10
Structure of gawk
  • gawk reads a file line by line and processes the
    contents however you specify
  • things gawk can do for you
  • compare values
  • count stuff up
  • reformat or change the data
  • print all or only certain parts
  • run complex logic functions
  • run math functions

11
Structure of gawk
  • Basic command structure
  • gtgt gawk do stuff filename.txt
  • You can run gawk as a program, lets say
    myprogram.awk
  • do stuff
  • Run the above gawk program with the following
    command
  • gtgt gawk f myprogram.awk
  • Note gawk divides data up into fields each
    time it sees a space or tab. Awk can
    interpret every value as a number or as text.

12
examples of commands and variables in gawk
  • use to separate command statements
  • 0 refers to whole input line
  • 1 refers to only first field
  • NF a system variable that contains number of
    fields or strings per line
  • e.g. NF refers to last field of line
  • print a command to print whatever follows it as
    a line on the screen
  • examples
  • print hello prints the word hello
  • print 0 prints the whole line
  • print NF prints the number of fields on
    the line
  • print NF prints the value of the last
    field on the line
  • print there are NF fields, the last field
    contains NF

13
examples of commands and variables in gawk
  • 1hello logic test measuring if first field
    equals hello
  • saveme1 assigns the value of first field to a
    new variable named saveme
  • 0LOL logic test to see if the line contains
    the string
  • LOL
  • length(3) command gets length of a field,
    string, or
  • whole line
  • if(1hello) greeting print 0
  • if true, 1) iterate variable value and 2)
    print line
  • for(i1iltNFi) print i
  • this for-loop will print each field as a new
    line

14
examples of commands and variables in gawk
  • toupper(0) converts all characters to upper
    case
  • tolower(0) convert all characters to lower
    case
  • gsub(0,ZERO,0)
  • replaces the number 0 with the word ZERO

15
Regular Expressions
  • These are the key to all complex text analyses
  • These are the underlying tools in all text based
    AI
  • Usage requires creative and logical thinking
  • Use them to clean and standardize your text data
  • There is much to learn and master
  • For help, use Google or you can ask me
  • In gawk, a regex is contained between slashes
    /regex/

16
Examples of basic regex test conditions
  • 0/tExT/
  • looks for exactly tExT to occur in the line
  • 0/teExT/
  • looks for either text or tExT
  • 0/0-9/
  • looks for any digit 0 9
  • 0/1-346g-i/
  • looks for any of the following 1,2,3,4,6,g,h,i

17
Some advanced regex functionality
  • Some characters have special meaning in Regex
    (e.g. /)
  • To refer to special meaning characters, you must
    precede it with something called an escape
    character. The escape character is this \.
  • 0/\// looks for / in the line of text
  • 0/\\/ looks for \ in the line of text
  • Special meaning characters include
  • . ? - ( )
  • Other control characters include
  • \t tab
  • \n new line
  • \NNN refers to ASCII value of character when
    NNN
  • is replaced by a number
Write a Comment
User Comments (0)
About PowerShow.com