Foundations of Statistical NLP Chapter 4. Corpus-Based Work - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Foundations of Statistical NLP Chapter 4. Corpus-Based Work

Description:

Getting Set up(1/2) Text corpora are usually big. major ... colon, semicolon, dash is regarded as a sentence. recent research sentence boundary detection ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 16
Provided by: klplReP
Category:

less

Transcript and Presenter's Notes

Title: Foundations of Statistical NLP Chapter 4. Corpus-Based Work


1
Foundations of Statistical NLPChapter 4.
Corpus-Based Work
  • ? ? ?

2
Abstract
  • Getting Set Up
  • Computers, Corpora, Software
  • Looking at Text
  • Low-level formatting issues
  • Tokenization What is a word?
  • Morphology
  • Sentences
  • Mark-up Data
  • Markup schemes
  • Grammatical tagging

3
Getting Set up(1/2)
  • Text corpora are usually big.
  • major limitation on the use of corpora
  • Computer? ???? ??
  • Corpora
  • use text corpora distributed by main organization
  • corpus special collection of textual material
  • general issue is representative sample of the
    population of interest.

4
Getting Set up(2/2)
  • Software
  • Text editors shows fairly literally
  • Regular expressions find certain pattern
  • Programming languages C, C, Perl
  • Programming techniques

5
(No Transcript)
6
Looking at Text
  • Text come a row format or marked up.
  • Markup
  • a term is used for putting code of some sort into
    a computer file.
  • commercial word processing WYSIWYG
  • Features of text in human languages
  • difficulty to process automatically

7
Low-level formatting issues
  • Junk formatting/content
  • junk document header, separator, table,
    diagram, etc.
  • OCR deal with only English text -gt remove junk
    (other text)
  • Uppercase and lowercase
  • The original Brown corpus was used to capital
    letter
  • Should we treat brown in Richard Brown and brown
    paint as the same?
  • proper name detection difficult problem

8
Tokenization What is a word?(1)
  • Tokenization
  • To divide the input text into unit called token
  • what is a word?
  • graphic word (Kucera and Francis. 1967)
  • a string of contiguous alphanumeric characters
    with space on either sidemay include hyphens and
    apo-strophes, but no other punctuation marks
  • -gt workable definition 22.50, Microoft,
    Cnet

9
Tokenization What is a word?(2)
  • Period
  • distinction end of sentence punctuation marks,
    abbreviation makrs as in etc. or Wash.
  • Single apostrophes
  • English contractions Ill or isnt
  • dogs dog is or dog has or genitive case
  • Hyphenation
  • line-breaking hyphen is present in typographical
    source
  • e-mail, 26-year-old, co-operate

10
Tokenization What is a word?(3)
  • The same form representing multiple words
  • homographs saw has two lexemes (chap 7)
  • Word segmentation in other languages
  • Many languages do not put spaces in between
    words
  • Whitespace not indicating a word break
  • the New York-New Haven railroad
  • Variant coding of information of a certain
    seman-tic type

11
Morphology
  • Stemming processing
  • a process that strips off affixes and leaves you
    with a stem.
  • lemmatization
  • one is attempting to find the lemma or lexeme of
    which one is looking at an inflected form
  • IR community has shown that doing stemm-ing does
    not help the performance

12
Sentences
  • What is a sentence?
  • something ending with a ., ? or !.
  • colon, semicolon, dash is regarded as a sentence
  • recent research sentence boundary detection
  • Riley(1989) statistical classification tree
  • Palmer and Hearst (1994 1997) a neural network
    to predict sentence boundaries
  • Mikheev(1998) Maximum Entropy approaches to the
    problem

13
Mark-up Schemes
  • early days, markup schemes
  • including header information in texts(giving
    author, date, title, etc.)
  • SGML
  • general language that lets one define a grammar
    for texts,
  • XML
  • subset of SGML particularly designed for web

14
Grammatical tagging
  • first step of analysis
  • automatic grammatical tagging for categories
  • distinguishing comparative and superlative
  • Tag sets (Table 4.5)
  • incorporate morphological distinction of a
    particular language
  • The design of a tag set
  • target feature of classification
  • useful information about the grammatical class of
    a word
  • predictive feature
  • prediction the behavior of other words in the
    context

15
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com