CSA2050%20Introduction%20to%20Computational%20Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

CSA2050%20Introduction%20to%20Computational%20Linguistics

Description:

John's car cost 10,000.00. 'And it's worth every penny', he exclaimed. Mar 2005 -- MR ... Interaction: the New York-New Haven railroad. Mixed language tokens : u ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 20
Provided by: michael307
Category:

less

Transcript and Presenter's Notes

Title: CSA2050%20Introduction%20to%20Computational%20Linguistics


1
CSA2050 Introduction to Computational Linguistics
  • Lecture 3
  • Examples

2
Course Contents
1 (MR) Overview
2 (RF) Chomsky Hierarchy
3 (MR) Examples
4 (RF) Grammatical Categories
5, 6 (MR) Tagging
7 (RF) Morphology
8, 9, 10 (MR) Comp Morphology
11 (RF) Syntax
12, 13, 14(MR) Grammar Formalism
3
Outline
  • Examples in the areas of
  • Tokenisation
  • Morphological Analysis
  • Tagging
  • Syntactic Analysis

4
Information Extraction
raw text
tokenisation
tagged text
morphological analysis
syntactic analysis
named entity recognition
5
Tokenisation
  • The basic idea of tokenisation is to identify the
    basic tokens that are present in a text.
  • Mostly, tokens are the same as words, but not
    always
  • Why should this be a problem?
  • Johns car cost 10,000.00.
  • And its worth every penny, he exclaimed.

6
Tokenisation ProblemsPunctuation
  • novel forms .net, Microoft, -)
  • hyphenation
  • linebreaks vs word-internal e-mail, 898-0587
  • multi-word the 90-cent-an-hour raise
  • confusion with dash
  • apostrophes in contractions we'll
  • periods
  • part of names Amazon.com
  • numerical expressions 1.99
  • abbreviations, end of sentence, haplology
  • commas 1,000,000

7
Other Problems
  • Token-internal whitespace 898 0464
  • Interaction the New York-New Haven railroad
  • Mixed language tokens u
  • Automated language guesser
  • Token equivalence (when are two tokens the same)?
  • Case-normalization.
  • Sentence boundary detection.
  • Inconsistency database, data-base, data base
  • Demo xerox tokeniser

8
Morphology
  • Simple versus complex wordsdogdogs
  • Complex words formed by concatenation of
    morphemes.
  • Morpheme The smallest unit in a word that bears
    some meaning, such as dog and s.

9
Morphological Analysis
  • Morphological analysis of a word involves a
    segmentation problem
  • Segmentation discovery of the component
    morphemesdogs ? dog senlargement ? en large
    ment
  • Possible ambiguitiesenlargement ? enlarge
    ment ? en largement
  • Role of lexicon

10
Morphological Analysis
  • John has a couple of rabbits
  • rabbits ? rabbit s
  • s indicates plural of noun rabbit
  • Is this the only possibility?

11
Morphological Analysis
  • John rabbits on and on
  • rabbits ? rabbit s
  • s indicates 3rd person singular plural of verb
    rabbit
  • The suffix s is a realisation of two entirely
    different morphemes.
  • The morpheme is something more abstract than the
    string which realises it.

12
Morphological Analysis
-s
-a
suffix world
morpheme world
3S
PL
13
Morphological Analysis
Output Analysis rabbit N PL rabbit V 3S
Input Word rabbits
Morphological Parser
  • Output is a string of morphemes
  • Morpheme is employed in a loose sense that
  • is useful for further processing

14
Morphological Analysis ENGTWOL Xerox
  • Atro Voutilainen, Juha Heikkilä, Timo Järvinen
    and Lingsoft, Inc. 1993-1995
  • ENGTWOL demo
  • Xerox morphological analysis

15
Morphological Synthesis
Input rabbit N PL rabbit V 3S
Output Word rabbits
Morphological Parser
  • Input is a string of morphemes
  • Ouput is a word

16
Reversibility
  • Lookup
  • APPLY UPgt left
  • left leaveVerbPastBoth123SP
  • left leftAdv
  • left leftAdj
  • left leftNounSg
  • Lookdown
  • APPLY DOWNgt leaveAdj
  • left

17
POS Tagging
  • In POS tagging, the task is to assign the most
    appropriate morphosyntactic label from amongst
    those listed in the lexicon, given the context.
  • John leaves presents.
  • Proper Names

18
Semantic Tagging
  • Named Entity Recognition
  • Basic idea is to recognise and tag named entities
    and classify them as being of type
  • Persons
  • Locations
  • Organisations
  • Named Entity Recognition - Demo

19
Syntactic Analysis
  • Problem given sentence and grammar/lexicon,
    discover assigned tree structure.
  • XIP Parser Demo
Write a Comment
User Comments (0)
About PowerShow.com