Corpus-Based Work - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Corpus-Based Work

Description:

Corpus-Based Work Chapter 4 Foundations of statistical natural language processing – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 23
Provided by: ahme84
Category:
Tags: based | corpus | slide | work

less

Transcript and Presenter's Notes

Title: Corpus-Based Work


1
Corpus-Based Work
  • Chapter 4
  • Foundations of statistical natural language
    processing

2
Introduction
  • Requirements of NLP work
  • Computers
  • Corpora
  • Application/Software
  • This section covers some issues concerning the
    formats and problems encountered in dealing with
    raw data
  • Low-level processing before actual work
  • Word/Sentence extraction

3
Getting Set Up
  • Computers
  • Memory requirements for large corpora
  • Statistical NLP methods involve counts required
    to be accessed speedily
  • Corpora
  • A corpus is a special collection of textual
    material collected according to a certain set of
    criteria
  • Licensing
  • Most of the time free sources are not
    linguistically marked-up

4
  • Corpora
  • Representative sample
  • What we find for sample also holds for general
    population
  • Balanced corpus
  • Each subtype of text matching predetermined
    criterion of importance
  • Importance in statistical NLP
  • Representative corpus
  • In results type/domain of corpus should be
    included

5
  • Software
  • Text editors
  • TextPad, Emacs, BBedit
  • Regular expressions
  • Patterns as regular language
  • Programming language
  • C/C widely used (Efficient)
  • Pearl for text preparation and formatting
  • Built in database and easy handling of
    complicated structures makes Prolog important
  • Java as pure Object Oriented gives automatic
    memory management

6
Looking at Text
  • Either in raw format or marked-up
  • Markup is used for putting some codes into data
    file, giving some information about text
  • Issues in automatic processing
  • Junk formatting/content (Corpus Cleaning)
  • Case sensitivity (All capitalize)
  • Proper Nouns?
  • Stress through capitalization
  • Loss of contextual information

7
  • Tokenization
  • Text is divided into units called tokens
  • Treatment of punctuation marks?
  • What is a word?
  • Graphic word (Kucera and Francis 1967)
  • A string of contiguous alphanumeric characters
    with white space on either side.
  • This is not practical definition even in case of
    Latin
  • Especially for news corpus some odd entries can
    be present e.g. Microoft, C net
  • Apart from these oddities there are some other
    issues

8
  • Periods
  • Words are not always bounded by white spaces
    (commas, semicolons and periods)
  • Periods are at the end of sentence and also at
    the end of abbreviations
  • In abbreviation they should be attached to words
    (Wash. wash)
  • When abbreviations occur at the end of sentence
    there is only one period present, performing both
    functions
  • Within morphology, this phenomenon is referred as
    haplology

9
  • Single Apostrophes
  • Difficulties in dealing with constructions such
    as Ill or isnt
  • The count of graphic word is 1 according to basic
    definition but should be counted as 2 words
  • 1. S? NP VP
  • 2. if we split then some funny words may occur in
    collection
  • End of quotations marks
  • Possessive form of words ending with s or z
  • Charles Law Muaz book

10
  • Hyphenation
  • Does sequence of letters with hyphen in-between,
    count as one or two?
  • Line ending hyphens
  • Remove hyphen at the end of line and join both
    parts together
  • If there is some other type of hyphen at end of
    line (haplology) then? (text-based)
  • Mostly in electronic text line breaking hyphens
    are not present, but there are some other
    issues.

11
  • Some things with hyphens are clearly treated as
    one word
  • E-mail, A-l-Plus and co-operate
  • Other cases are arguable
  • Non-lawyer, pro-Arabs and so-called
  • The hyphens here are called lexical hyphens
  • Inserted before or after small word formatives to
    split vowel sequence in some cases
  • Third class of hyphens is inserted to indicate
    correct grouping
  • A text-based medium
  • A final take-it-or-leave-it offer

12
  • Inconsistencies in hyphenation
  • Cooperate ? Co-operate
  • So we can have multiple forms treated as either
    one word or two
  • Lexemes
  • Single dictionary entry with single meaning
  • Homographs
  • Two lexemes have overlapping forms/nature
  • Saw

13
  • Word segmentation in other languages
  • Opposite issue
  • White spaces but not word boundary
  • the New York-New Heaven railroad
  • I couldnt work the answer out
  • In spite of, in order to, because of
  • Variant coding of information of certain semantic
    type
  • Phone numbers 42-111-128-128
  • Problem in information extraction

14
  • Speech Corpora Issues
  • More contractions
  • Various phonetic representations
  • Pronunciation variants
  • Sentence fragments
  • Filler words
  • Morphology
  • Keep various forms separately or collapse them?
    e.g. sit, sits, sat
  • Grouping them together and working with lexemes
    (Initially looks easier)

15
  • Stemming
  • Strips off affixes
  • Lemmatization
  • To extract the lemma or lexeme from inflected
    form
  • Empirical research within IR shows that stemming
    does not help in performance
  • Information loss (operating ?operate)
  • Closely related tokens are grouped in chunks,
    which are more useful
  • Not good for morphologically rich languages

16
  • Sentences
  • What is a sentence?
  • In English, something ending with ., ? or !
  • Abbreviations issues
  • Other issues
  • you reminded me, she remarked, of your mother.
  • Nested things are classified as clauses
  • Quotation marks after punctuation
  • . is not sentence boundary in this case

17
  • Sentence boundary (SB) detection
  • Place tentative SB after all occurrences of .?!
  • Move the boundary after quotation mark (if any)
  • Disqualify a period boundary in case of
  • Preceded by an abbreviation not at sentence end ,
    and capitalized Prof., Dr.
  • Or not followed by capitalized words like in case
    of etc., jr.
  • Disqualify a boundary with ? Or !
  • If followed by a lower case letter
  • Regard all other as correct SBs

18
  • Riley (1989) used classification trees for SB
    detection
  • Features of trees included case and length of
    words preceding or following a period and
    probabilities of words to occur before and after
    a sentence boundary
  • It required large quantity of labeled data
  • Palmer and Hearst used POS of such words and
    implemented with Neural Networks (98-99
    accurate)
  • In other languages?

19
  • Marked-up Data
  • Some sort of code is used to provide information
    (mostly SGML, XML)
  • It can be done automatically, manually or mixture
    of both (Semi-Automatic)
  • Some texts mark up just sentence and paragraph
    boundaries
  • Other mark up more than this basic information
  • e.g. Pen Treebank (Full syntactic structure)
  • Common mark up is POS tagging

20
  • Grammatical Tagging
  • Generally done with conventional POS tagging like
    Noun, Verbs etc.
  • Also some information regarding nature of the
    words like Plurality of nouns or Superlative
    forms of adjectives
  • Tag set
  • The most influential tag set have been the one
    used to tag American Brown Corpus and
    Lancaster-Oslo-Bergen corpus

21
  • Size of tag sets
  • Brown 87 179 (Total tags)
  • Penn 45
  • Claws1 132
  • Penn tag set is widely used in computational work
  • Tags are different in different tag sets
  • Larger tag sets obviously have fine-grained
    distinctions
  • Detail level is according to domain of corpora

22
  • The design of tag set
  • Grammatical class of word
  • Features to tell the behavior of the word
  • Part of Speech
  • Semantic grounds
  • Syntactic distributional grounds
  • Morphological grounds
  • Splitting tags in further categories gives
    improved information but makes classification
    harder
  • There is not a simple relationship between tag
    set size and performance of taggers
Write a Comment
User Comments (0)
About PowerShow.com