CSA3180: Natural Language Processing - PowerPoint PPT Presentation

About This Presentation

CSA3180: Natural Language Processing


Periods and apostrophes present special problems. Periods: End of sentence ... Apostrophe. Contractions (won't, they're, can't, it's) Merged forms (dunno, aintcha) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 37
Provided by: michael307


Transcript and Presenter's Notes

Title: CSA3180: Natural Language Processing

CSA3180 Natural Language Processing
  • Statistics 1 Empirical Approach
  • Historical Background
  • Fundamental Issues
  • Tokenisation and Preprocessing

  • Slides based on Lectures by Mike Rosner (2003)
    and BNC2 POS Tagging Manual (Leech and Smith,
  • Foundations of Statistical Language Processing,
    Manning and Schütze, MIT, 1999
  • Resources for statistical/empirical NLP
  • http//nlp.stanford.edu/links/statnlp.html
  • McEnery Wilson notes on Corpus Linguistics
  • http//www.ling.lancs.ac.uk/monkey/ihe/linguistics

Historical Perspective
  • Pre-Chomsky linguistics (e.g. Boas 1940) was
    largely empirical
  • 1970s Rationalist approach to AI systems in
    restricted domains (e.g. Winograd 1972, Woods
    1977, Waltz 1978)
  • 1980s hand-coded grammars and knowledge bases
    (e.g. Allen 1987)
  • Hand-coded systems need great deal of
    domain-specific/expert knowledge engineering
  • Systems brittle, unscaleable and inflexible
  • Second half of 1980s focus shifted from
    rationalist methods to empirical/corpus-based
  • Development largely data driven

Historical Perspective
  • Linguistics Research Automatic Induction of
    lexical and syntactic information from corpora
  • Speech Recognition resulted in Hidden Markov
    Models (HMM) based methods (IBM Yorktown Heights)
    that outperformed previous knowledge-based
  • Use of probabilistic finite state machines to
    model word pronunciations
  • Make use of hill-climbing training algorithms to
    fit model parameters to actual speech data

Application Areas
  • Success of statistical methods in speech spread
    to other areas like POS tagging, spelling
    correction, and parsing
  • POS Tagging assigning appropriate syntactic
    class tags to words
  • Machine Translation training on bilingual
    corpora to extract word and contextual mappings
  • Parsing based on tree banks (large databases of
    sentences annotated with syntactic parse trees),
    such as probabilistic CFGs (PCFGs)
  • Word-sense disambiguation attachment, anaphora
    resolution, discourse segmentation
  • Content-based document processing
  • Information Extraction text ? filled templates
  • Information Retrieval query text ? set of
    relevant documents

Empirical Approach Issues
  • Potential for solutions to old problems
  • Knowledge Acquisition
  • Coverage
  • Robustness
  • Domain Independence
  • Feasibility depends on data and computing
  • Pros
  • Emphasis on applications and evaluation
  • Scalability and applicability to real-life
  • Cons
  • Results always corpus dependent

Corpus Starting Point
  • Corpus (corpora) is an organised body of
    materials from language that is used as the basis
    for empirical studies.
  • Important corpus characteristics
  • Statistical Representativeness/balance
  • Medium printed, electronic text, speech, video,
  • Language monolingual/multilingual
  • Information Content plain text vs. tagged text
  • Structure trees vs. sentences
  • Size
  • Standards
  • Quality

Corpora Examples
  • Project Gutenberg collection of public domain
  • http//www.gutenberg.org
  • Brown Corpus tagged corpus of around 1 million
    words put together at Brown University in 1960s
    and 70s. Balanced corpus of American English.
  • British National Corpus a balanced corpus of
    British English containing over 100 million words
    with morphosyntactic annotation.
  • http//www.natcorp.ox.ac.uk
  • Penn Treebank
  • WordNet
  • Canadian Hansards
  • LDC GigaWord

Tagset Example
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • AJ0
  • Adjective (general or positive) (e.g. good, old,
  • AJC
  • Comparative adjective (e.g. better, older)
  • AJS
  • Superlative adjective (e.g. best, oldest)
  • AT0
  • Article (e.g. the, a, an, no)
  • AV0
  • General adverb an adverb not subclassified as
    AVP or AVQ (see below) (e.g. often, well, longer
    (adv.), furthest.
  • AVP
  • Adverb particle (e.g. up, off, out)

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • AVQ
  • Wh-adverb (e.g. when, where, how, why, wherever)
  • CJC
  • Coordinating conjunction (e.g. and, or, but)
  • CJS
  • Subordinating conjunction (e.g. although, when)
  • CJT
  • The subordinating conjunction that
  • CRD
  • Cardinal number (e.g. one, 3, fifty-five, 3609)
  • DPS
  • Possessive determiner-pronoun (e.g. your, their,

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • DT0
  • General determiner-pronoun i.e. a
    determiner-pronoun which is not a DTQ or an AT0.
  • DTQ
  • Wh-determiner-pronoun (e.g. which, what, whose,
  • EX0
  • Existential there, i.e. there occurring in the
    there is ... or there are ... construction
  • ITJ
  • Interjection or other isolate (e.g. oh, yes, mhm,
  • NN0
  • Common noun, neutral for number (e.g. aircraft,
    data, committee)

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • NN1
  • Singular common noun (e.g. pencil, goose, time,
  • NN2
  • Plural common noun (e.g. pencils, geese, times,
  • NP0
  • Proper noun (e.g. London, Michael, Mars, IBM)
  • ORD
  • Ordinal numeral (e.g. first, sixth, 77th, last) .
  • PNI
  • Indefinite pronoun (e.g. none, everything, one
    as pronoun, nobody)
  • PNP
  • Personal pronoun (e.g. I, you, them, ours)

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • PNQ
  • Wh-pronoun (e.g. who, whoever, whom)
  • PNX
  • Reflexive pronoun (e.g. myself, yourself, itself,
  • POS
  • The possessive or genitive marker 's or '
  • PRF
  • The preposition of
  • PRP
  • Preposition (except for of) (e.g. about, at, in,
    on, on behalf of, with)
  • PUL
  • Punctuation left bracket - i.e. ( or

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • PUN
  • Punctuation general separating mark - i.e. . , !
    , - or ?
  • PUQ
  • Punctuation quotation mark - i.e. ' or "
  • PUR
  • Punctuation right bracket - i.e. ) or
  • TO0
  • Infinitive marker to
  • UNC
  • Unclassified items which are not appropriately
    considered as items of the English lexicon.

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • VBB
  • The present tense forms of the verb BE, except
    for is, 's i.e. am, are, 'm, 're and be
    subjunctive or imperative
  • VBD
  • The past tense forms of the verb BE was and were
  • VBG
  • The -ing form of the verb BE being
  • VBI
  • The infinitive form of the verb BE be
  • VBN
  • The past participle form of the verb BE been
  • VBZ
  • The -s form of the verb BE is, 's

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • VDB
  • The finite base form of the verb BE do
  • VDD
  • The past tense form of the verb DO did
  • VDG
  • The -ing form of the verb DO doing
  • VDI
  • The infinitive form of the verb DO do
  • VDN
  • The past participle form of the verb DO done
  • VDZ
  • The -s form of the verb DO does, 's

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • VHB
  • The finite base form of the verb HAVE have, 've
  • VHD
  • The past tense form of the verb HAVE had, 'd
  • VHG
  • The -ing form of the verb HAVE having
  • VHI
  • The infinitive form of the verb HAVE have
  • VHN
  • The past participle form of the verb HAVE had
  • VHZ
  • The -s form of the verb HAVE has, 's

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • VM0
  • Modal auxiliary verb (e.g. will, would, can,
    could, 'll, 'd)
  • VVB
  • The finite base form of lexical verbs (e.g.
    forget, send, live, return) Including the
    imperative and present subjunctive
  • VVD
  • The past tense form of lexical verbs (e.g.
    forgot, sent, lived, returned)
  • VVG
  • The -ing form of lexical verbs (e.g. forgetting,
    sending, living, returning)
  • VVI
  • The infinitive form of lexical verbs (e.g.
    forget, send, live, return)
  • VVN
  • The past participle form of lexical verbs (e.g.
    forgotten, sent, lived, returned)

Tagset Examples
  • Here are some example POS tags from the BNC
    (CLAWS4 BNC Basic Tagset/C5 Tagset)
  • VVZ
  • The -s form of lexical verbs (e.g. forgets,
    sends, lives, returns)
  • XX0
  • The negative particle not or n't
  • ZZ0
  • Alphabetical symbols (e.g. A, a, B, b, c, d)

Tagging Algorithms
  • Manual Tagging
  • Automatic Tagging
  • Stochastic Most probable sequence of categories
  • Rule Based E.g. if preceding word is a DT0
    (determiner) then the next tag is probably NN0 or
    NN1 or NN2 (nouns)
  • Transformation Based trainable, machine-learning

Low Level Processing
  • Pre-processing
  • Filtering headers, whitespace, etc.
  • Reformatting and creation of appropriate
  • Data Gathering/Formatting/Transformation/Input
  • Tokenisation
  • Normalisation
  • Initial Tag Assignment
  • Tag Selection/Disambiguation
  • Post-processing

  • Divide input text into units called tokens can
    be either individual word tokens or orthographic
  • Tokens usually of different types words,
    numbers, punctuation
  • What is a word?
  • a string of contiguous alphanumeric
    characters with space on either side may
    include hyphens and apostrophes but no other
    punctuation marks.
  • (Kucera and Francis,1967)

  • Token segments usually demarcated by white space
    or sentence boundaries (i.e. final sentence
    punctuation followed by initial capital letter of
    next sentence)
  • Not straightforward due to ambiguity of
    punctuation marks and of capital letters!

Tokenisation Problems
  • Words may contain non-alphanumeric characters
  • 27.40
  • B.Sc.IT(Hons.)
  • cya l8r -)
  • www.maltalinks.com
  • Presence of spaces around words do not
    necessarily indicate a unit break, e.g. Coca Cola
  • Items of particular semantic types that use
    spaces, e.g. phone numbers
  • 1 202-456-1414

Tokenisation Problems
  • Some languages use spaces very sparingly (like
    agglomerative languages such as German or
  • Geschwendigkeitsbegrenzung (speed limit)
  • Rindfleischetikettierungsüberwachungsaufgabenübert
    ragungsgesetz (beef labelling law)
  • Rindfleisch beef meat
  • etikettierungs label ing
  • überwachungs over watch
  • aufgaben task over
  • übertragungs give ing
  • gesetz law

Tokenisation Problems
  • Some languages do not use spaces at all! (like
    Chinese, Japanese, Thai)
  • Word segmentation for these languages can
    approach that of sentence segmentation in other
  • Probabilistic word segmentation gives quite good

Tokenisation Problems
  • Specialised formats (like phone numbers, URLs)
    takes us from tokenisation towards Information
  • Hand crafted rules and regular expressions can be
    used to handle some common cases
  • Brittle and inflexible automated learning
    methods are preferable

  • Detaching spaces, semi-colons, commas, etc. from
    words is quite easy
  • Periods and apostrophes present special problems
  • Periods
  • End of sentence (.)
  • Abbreviations (e.g., etc., B.Sc.)
  • Numbers and date formats

  • Contractions
  • (wont, theyre, cant, its)
  • Merged forms
  • (dunno, aintcha)
  • Trailing enclitics
  • Solution is often to have lookup tables for
    common (and not so common) forms

Apostrophe BNC2 Solution
  • Built-in Knowledge

Orthographic Form Broken down into Component tags
'd've 'd 've VM0 VHI
'tis 't is PNP VBZ
'twas 't was PNP VBD
'twere 't were PNP VBD
'twould 't would PNP VM0
I'd've I 'd 've PNP VM0 VHI
ain't ai n't UNC XX0
aint ai nt UNC XX0
aintcha ai nt cha UNC XX0 PNP
an'all an' all / an'all CJC DT0 / AV0
arent are nt VBB XX0
  • Trailing Enclitics

Enclitic form Available Tags
'd VM0 / VHD
'm VBB
's VBZ / VHZ / VDZ / POS
'll VM0
n't XX0
're VBB
've VHB
  • Hyphens are usually treated as word internal
  • Not always the case (e.g. il-ktieb in Maltese)
  • Hyphens can also be used as quotation marks

  • Two tokens containing same characters are often
    instances of the same type
  • The, THE, the
  • Mapping to same case can work in reducing amount
    of data to be stored (e.g. map all instances of
    the to the)
  • Heuristics
  • Map first character of a sentence to lowercase
  • Map all words in titles to lowercase
  • Problems
  • Identification of sentence boundaries
  • Identification of proper names

Types vs. Tokens
  • How many words are there in this sentence?
  • The quick brown fox jumps over the lazy dog
  • 9 tokens
  • 8 types the, quick, brown, fox, jumps, over,
    lazy, dog
  • Wordform types every different/unique form
  • Lemmas every root word/unique entry

How many words in English?
  • Switchboard Corpus of spoken English 2.4 million
    tokens, 20,000 wordform types
  • Shakespeare 884,647 tokens, 29,066 wordform
  • Gutenberg project and GigaWord sample from Morpho
    Challenge 2005 24,447,034 tokens, 167,377 types
  • http//www.cis.hut.fi/morphochallenge2005/datasets
  • Type/token ratio

  • Are eat and eats different words?
  • Two different wordforms
  • Same lemma (same stem)
  • Stemming vs. morphological analysis (depends on
  • Porter stemmer
Write a Comment
User Comments (0)
About PowerShow.com