Title: An introduction to techniques for text data analysis
1An introduction to techniques for text data
analysis
- Josh Wortman
- IEM Department
- Technion
2Why develop computer assisted methods for
analyzing text data?
- To process large amounts of non-parametric data
quickly - To avoid dealing with inter observer agreement
challenges - To avoid dealing with noise in self reports
- Allows replication of analysis throughout time or
on other data sets - To find patterns in language use that are not
measured by current data methods - To gain new experimental insight and make more
novel hypotheses
3Steps to analyze text data
- Check data completeness and labeling
- Do some eyeball checks
- Think about the trajectory of your analysis
- Choose an analysis environment
- Excel
- you already know how to use it
- there are some nice functions
- overall limited tool set
- Unix
- more of a learning curve
- requires programming logic
- unlimited analytic techniques possible
- Develop code, queries, or variables
- Analyze results
4What do we get with Excel?
- Quick high level analyses
- length
- word count
- Directed analyses when you are looking for
something very specific - Relevant functions in Excel
- Clean()
- Trim()
- Len()
- Substitute()
- Len()-Substitute
5What do we get with Unix?
- Command line functions
- Unlimited analysis control more freedom to
explore data - Regular Expressions
- Awk and Perl text mining tool boxes
- Summarization functionality
- Build a collection of tools you use often
6Getting the data into a unix environment
- Google Putty, download putty.exe, put icon on
desktop - Google WinScp, download winscp.exe, install it.
- Export the data from Excel as tab delimited
- - note save filename without spaces
- Open WinScp, connect to your server account, copy
the Excel file there - Open Putty.exe, logon to server account
7Getting familiar with unix
- cd move between directories
- pwd see where you are (print working directory)
- ls shows files in directory
- use option -l to see more information
- cat print an entire file onto the screen
- less read and navigate through a file
- use option -S to display data nicer
- wc counts number of lines in a file
- grep searches for text, works like find in MS
Word - e.g. gtgt grep how are you? file.txt
- sort sorts data in file
- use option -n to sort in numerical order
- use option -r to sort in reverse order
8Getting familiar with unix
- uniq shows all the unique values that exist,
requires data to be sorted first - use option -c to count how many of each
value exist - head prints only the first 10 lines of a file
- use option -n, replacing n with a number to
print a different number of lines - gt this is called the redirect character, it
tells Unix to write to a file - e.g. gtgt sort file.txt gt file.sort.txt
- this is called pipe and tells Unix to
process the command before it then use the
result as an input for what comes after it. - e.g. gtgt sort file.txt uniq
9Getting familiar with unix
- man shows manual for each Unix command
- note not always so easy to read
- e.g. gtgt man sort
- dos2unix / unix2dos
- converts file encoding to improve readability
- gawk / awk
- your new friend. Use this easy scripting
language to - do analyses
10Structure of gawk
- gawk reads a file line by line and processes the
contents however you specify - things gawk can do for you
- compare values
- count stuff up
- reformat or change the data
- print all or only certain parts
- run complex logic functions
- run math functions
11Structure of gawk
- Basic command structure
- gtgt gawk do stuff filename.txt
- You can run gawk as a program, lets say
myprogram.awk -
- do stuff
-
- Run the above gawk program with the following
command - gtgt gawk f myprogram.awk
- Note gawk divides data up into fields each
time it sees a space or tab. Awk can
interpret every value as a number or as text.
12examples of commands and variables in gawk
- use to separate command statements
- 0 refers to whole input line
- 1 refers to only first field
- NF a system variable that contains number of
fields or strings per line - e.g. NF refers to last field of line
- print a command to print whatever follows it as
a line on the screen - examples
- print hello prints the word hello
- print 0 prints the whole line
- print NF prints the number of fields on
the line - print NF prints the value of the last
field on the line - print there are NF fields, the last field
contains NF
13examples of commands and variables in gawk
- 1hello logic test measuring if first field
equals hello - saveme1 assigns the value of first field to a
new variable named saveme - 0LOL logic test to see if the line contains
the string - LOL
- length(3) command gets length of a field,
string, or - whole line
- if(1hello) greeting print 0
- if true, 1) iterate variable value and 2)
print line - for(i1iltNFi) print i
- this for-loop will print each field as a new
line
14examples of commands and variables in gawk
- toupper(0) converts all characters to upper
case - tolower(0) convert all characters to lower
case - gsub(0,ZERO,0)
- replaces the number 0 with the word ZERO
15Regular Expressions
- These are the key to all complex text analyses
- These are the underlying tools in all text based
AI - Usage requires creative and logical thinking
- Use them to clean and standardize your text data
- There is much to learn and master
- For help, use Google or you can ask me
- In gawk, a regex is contained between slashes
/regex/
16Examples of basic regex test conditions
- 0/tExT/
- looks for exactly tExT to occur in the line
- 0/teExT/
- looks for either text or tExT
- 0/0-9/
- looks for any digit 0 9
- 0/1-346g-i/
- looks for any of the following 1,2,3,4,6,g,h,i
17Some advanced regex functionality
- Some characters have special meaning in Regex
(e.g. /) - To refer to special meaning characters, you must
precede it with something called an escape
character. The escape character is this \. - 0/\// looks for / in the line of text
- 0/\\/ looks for \ in the line of text
- Special meaning characters include
- . ? - ( )
- Other control characters include
- \t tab
- \n new line
- \NNN refers to ASCII value of character when
NNN - is replaced by a number