Text Analysis Tools Development Introduction - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Text Analysis Tools Development Introduction

Description:

Computer literary analysis is currently concerned with the machine entry and markup of texts ... app: Make a citation file (a concordance) for each word (D) 17 ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 22
Provided by: chssMon
Category:

less

Transcript and Presenter's Notes

Title: Text Analysis Tools Development Introduction


1
Text Analysis Tools DevelopmentIntroduction
  • go to
  • http//www.chss.montclair.edu/linguistics/l
    ingpage/faculty
  • /fitz/textools/texsyll05.htm
  • under Approximate Course Schedule
  • click on 9/13 Lab

2
Course Details
  • the syllabus
  • the text(s)
  • the final project

3
Course Purpose
  • To teach you how to use the text processing tools
    of Unix, the programming language Perl and Web
    facilities to
  • locate words in text
  • compare texts
  • get part of speech information about a text
  • build a dictionary
  • do morphological analysis
  • etc.

4
Tonights Lecture
  • gives an overview of applications in language and
    linguistics computing
  • gives an overview and some practice with online
    software
  • gives an introduction to off-the-shelf software

5
Some Fields that Use Computer Analysis of
Language
  • Literary Analysis
  • Linguistic Analysis
  • theoretical
  • applied
  • Lexicography
  • Language Teaching
  • Translation
  • Publishing/Language Arts Teaching

6
Text vs. Corpus
7
A Text
  • A text is written language that illustrates a
    particular genre or variety of language usage.
  • Literary computing usually examines a specific
    text for lexical patterns to build a semantic
    model of the text. The text analyzed is usually
    small (less than 100,000 words) and can be
    tagged, edited, etc. by hand. see (1) on
    the worksheet
  • Forensic linguistics is another area that
    analyses text.
  • see (2) on the worksheet

8
Forensics Author Identification
  • vocabulary richness
  • type/token ratio
  • hapax legomena ratio
  • sentence length and word length
  • word choice
  • semantic categories
  • spelling errors
  • prescriptive grammar errors
  • sentential complexity
  • punctuation

9
A Corpus
  • Many fields use corpora for language modeling.
  • A corpus is a collection of texts and/or speech
    that is a representative sample of that portion
    of the language that is to be modeled. see (3)
    on the worksheet

10
Literary vs. Linguistic Analysis
  • Linguistic computing usually examines a corpus
    for linguistic patterns to build a model of some
    facet of the language. The corpus analyzed is
    usually more than 1,000,000 words and must be
    manipulated by computer programs, or
    tools.

11
Sample Computer Applicationsin Literary and
Linguistic Computing
12
Literary Analysis
  • Computer literary analysis is currently concerned
    with the machine entry and markup of texts
  • Some interesting work has been done despite the
    paucity of annotated texts on line. see
    (4),(5),(6) on the worksheet

13
Linguistic Analysis
  • theoretical - since theoretical linguistics
    studies competence, computer tools are not
    normally used. There is some use of tools to
    define properties of the lexicon (e.g., Levin)
  • applied - applied linguistics studies language
    performance. It depends on large amounts of data
    to determine patterns of language behavior. To
    search large amounts of data for patterns, we
    need computer tools.

14
Computational Applications in Applied Linguistics
15
Lexicography
  • Writing of mainstream dictionaries
  • Find new meanings of words (A)
  • Writing of specialized dictionaries for
  • language learners
  • specialized fields
  • lesser known languages
  • children

16
Tasks in Writing Specialized Dictionaries (1)
  • Determine the main entries for the dictionary
  • app generate a word frequency list from the
    specialized texts (B)
  • app compare this list to a representative corpus
    to obtain the words that belong to the
    specialized language (C)
  • Write the definition(s) for each entry
  • app Make a citation file (a concordance) for
    each word (D)

17
Words that characterize a document
frequency of words
words that characterize a document
words by rank order
18
Tasks in Writing Specialized Dictionaries (2)
  • Write notes on usage of the entry.
  • app analyze the concordance of the entry (E)
  • Make sure your defining vocabulary is not more
    difficult than the words you are trying to define
  • app check your defining vocabulary against a
    standard defining vocabulary (similar to (C)
    above)

19
Tasks in Language Teaching
  • Evaluate student progress
  • app compare written vocabulary at beginning and
    end of semester (F)
  • app compare part of speech usage at beginning
    and end of semester (G)
  • app track specific errors over the semester
  • (H)

20
Tasks in Translation
  • Disambiguate word senses (I)
  • Translate sentences (á la babblefish!) (J)
  • These are fairly mangled translations, but
    they give you an understanding of why machine
    translation is so hard and what might be done
    to improve it.

21
Teaching/Publishing
  • Determine the ratio of Latinate to Anglo-Saxon
    morphemes in English. (K)
  • Determine the usage of Latinate forms in
    childrens literature in English. (L)
  • This may have theoretical implications.
Write a Comment
User Comments (0)
About PowerShow.com