CLINT - PowerPoint PPT Presentation

About This Presentation
Title:

CLINT

Description:

Introduction to Computational Linguistics. 3. Start with a Corpus ... Introduction to Computational Linguistics. 14. Exercise. How is the apostrophe used in Maltese ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 22
Provided by: MikeR2
Category:
Tags: clint

less

Transcript and Presenter's Notes

Title: CLINT


1
CLINT
  • Tokenisation

2
Information Food Chain
  • Inference
  • Knowledge Representation
  • Meaning Extraction
  • Semantic Relationships
  • Chunking (noun phrases verb phrases)
  • Part of Speech Annotation
  • Paragraph and sentence identification
  • Tokenisation
  • Raw Text

3
Start with a Corpus
  • A corpus is an organised body of materials from
    language that is used as a basis for empirical
    studies.
  • Corpora classfied according to
  • Representativeness
  • Medium
  • Language
  • Information Content
  • Structure

4
Examples of Corpora
  • Project Gutenberg public domain text resources.
    http//www.promo.net/pg
  • Brown Corpus a tagged corpus of about 1M words
    put together at Brown 1960-70
  • Penn Treebank a corpus of parsed sentences based
    on text from the WSJ
  • Canadian Hansards bilingual (En Fr) corpus the
    Canadian parliament.

5
Low Level Issues
  • Preprocessing getting rid of junk such as
    whitespace, images, certain formatting
    information etc.
  • Normalisation deciding on standard character
    representations adopting upper or lower case (or
    both)
  • Tokenisation

6
Tokenisation
  • Tokenisation is a process which divides input
    text into individual units called tokens.
  • Tokens are normally taken to be indivisible by
    the next level of analysis, but they can be
    associated with various kinds of information.
  • An example of such information is the type of the
    token word, punctuation, number

7
What counts as a word?
  • Words are quite tricky to define
  • The standard definition a string of contiguous
    alphanumeric characters with space on either
    side may include hyphens and apostrophes but no
    other punctuation marks (Kucera and Francis 1967)
  • It is easy to find exceptions.

8
Problems Identifying Words
  • VfB Stuttgart scored twice in quick success-ion
    early in the second half on their way to a
    deserved 2-1 victory over Manchester United in
    the Champions League on Wednesday.(example from
    Mary Dalrymple, University of London)
  • VfB Stuttgart, Manchester United
  • succession
  • 2-1
  • Wednesday

9
Problems Identifying WordsProblems Involving
Spaces
  • Lack of spaces between wordsLebensversicherungsge
    sellschaftsanngesteller (life insurance company
    employee)Ix-Xemx
  • The presence of spaces may not indicate a word
    breakCoca Cola 356 21 456 457

10
Problems Involving Special Characters
  • Words often include non-alphanumeric characters
    which are actually part of the word.22.50
    www.di-ve.com.mt BSc. IT -)
  • Words are often terminated by punctuation which
    is not part of the word.
  • Sometimes, terminating punctuation is part of the
    word.

11
Periods
  • In general, punctuation marks attach to words,
    and can be removed. However there are special
    cases
  • Most periods mark end of sentence
  • Others mark abbreviations, e.g. "e.g.". "Wash."
  • Note that when an abbreviation occurs at the end
    of a sentence there is only one period.

12
Apostrophe
  • English contractions such as won't or I'll count
    as one word according to the classic definition
  • However there are reasons for wanting two
    separate tokens such as interaction with
    grammar rules (S ? NP VP)
  • Penn Treebank splits such contractions into two
    words.

13
Apostrophe
  • This sometimes leaves odd wordsFor example isnt
    yields is n't
  • 's is ambiguous
  • Abbreviation for is (he's strange)
  • Possessive (John's car)
  • Word-final aprostrophe is ambiguous
  • end of quotation
  • possessive of word ending in s

14
Exercise
  • How is the apostrophe used in Maltese
  • How should a Maltese tokeniser deal with it?

15
Hyphen
  • Issue do sequences of words joined by hyphens
    count as one word or more?
  • Typesetting hyphens (at end of line) and hyphens
    in measure phrases (35-year-old)are usually
    removed.
  • Typesetting hyphens can be ambiguous
  • Lexical hyphens are usually kepthi-fi
  • Hyphens standing alone are used as
    punctuation.
  • Texts are often inconsistent in usage of hyphens

16
Case
  • Types vs. Tokens
  • How many tokens in the following sentenceThe
    cat chased the rat on the table
  • How many types?
  • Tokenisation should correctly identify word
    types, i.e.
  • Tokens of the same type should be identified
  • Tokens of different type should be distinguished
  • Case representation of ordinary words must be
    standardised.

17
Case
  • Heuristics
  • Map first character of a sentence to standard
    case
  • Map all words in titles to lowercase
  • Problems
  • Identification of sentence boundaries
  • Identification of proper names

18
Normalisation
  • Character representations.
  • Converting all letters to lower or upper case
  • Removing punctuation
  • Removing letters with accent marks and other
    diacritics
  • Expanding abbreviations

19
Further Normalisation
  • Stemming are eats and eating different words?
  • They are two different wordforms
  • that have the same stem, eat, but different
    suffixes, -s and -ing
  • Stemming versus full morphological analysis.

20
Summary
  • The tokenisation problem interacts with design
    decisions at different levels concerning
  • Handling of non alphanumeric characters
  • Case
  • Punctuation
  • Typically many of these problems are dealt with
    by hand crafting special rules which match a
    particular case.
  • Such rules are often built out of regular
    expressions.

21
Sources
  • Foundations of Statistical Language Processing,
    Manning and Schütze, MIT 1999
Write a Comment
User Comments (0)
About PowerShow.com