Ch1. Introduction - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Ch1. Introduction

Description:

The Ambiguity of Language: Why NLP Is Difficult (2) ... Ambiguities lead to a terrible ... The goal of maximizing coverage while minimizing ambiguity is... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 30
Provided by: bolo8
Category:

less

Transcript and Presenter's Notes

Title: Ch1. Introduction


1
Ch1. Introduction
Foundations of Statistical Natural Language
Processing
  • ?????
  • ????????
  • ???
  • 2002.01.10

2
Contents
  • Rationalist and Empiricist Approaches to Language
  • Scientific Content
  • The Ambiguity of Language Why NLP Is Difficult
  • Dirty Hands

3
Rationalist and Empiricist Approaches to Language
(1)
  • Rationalist(?????) Approach
  • 19601985, Chomsky
  • Distinct crucially between
  • Linguistic Competence (?? ??)
  • Linguistic Performance (?? ??)
  • The key parts of language are innate.
  • Hardwired in the brain at birth as parts of human
    genetic inheritance.
  • Rule-based approach
  • Grammar rules are already exist in the beginning.

4
Rationalist and Empiricist Approaches to Language
(2)
  • Empiricist(?????) Approach
  • Organizing and generalizing the linguistic
    knowledge from the sensory input.
  • An empiricist approach to NLP
  • Specifying an appropriate general language model.
  • Inducing the values of parameters
  • By statistical, pattern recognition, and machine
    learning method.
  • Corpus-based approach
  • You shall know a word by the company it
    keeps.(J.R.Firth)
  • Currently wide-spread.

5
Rationalist and Empiricist Approaches to Language
(3)
  • Rationalist vs. Empiricist
  • Chomskyan(generative) linguists
  • Categorical(???) principles
  • Saying categorical judgment(????)
  • Sentences either do or do not satisfy the rule.
  • Statistical NLP practitioners
  • Assigning probabilities to linguistic events
  • Saying which sentences are usual and unusual.

6
Scientific Content
  • Questions that linguistics should answer
  • Non-categorical phenomena in language
  • Language and cognition as probabilistic phenomena

7
Questions that linguistics should answer (1)
  • Two Basic Questions
  • What kinds of things do people say?
  • Covers all aspects of structure of language
  • What do these things say/ask/request about the
    world?
  • Deals with
  • Semantics(???)
  • Pragmatics(???,???) ??? ???? ?? ??
  • Discourse(???) ???? ? ??? ??

8
Questions that linguistics should answer (2)
  • Grammaticality(???)
  • Judged purely on whether a sentence is
    structurally well-formed.
  • Not according whether it is the kind of thing
    that people would say or whether it is
    semantically anomalous.
  • e.g. Colorless green ideas sleep furiously.
    (Chomsky)
  • Grammatical, although semantically strange.

9
Questions that linguistics should answer (3)
  • Conventionality(???)
  • Simply a way in which people frequently express,
  • Even though other ways are in principle possible.
  • Non-native speakers often say something
    ungrammatically
  • We can understand that.
  • But it would sound better expressed slightly
    differently.

10
Non-categorical phenomena in language (1)
  • Blending of parts of speech near
  • Not blended
  • We will review that decision in the near future.
    (adjective)
  • He lives near the station. (preposition)
  • Blended
  • He has never been nearer the center of the
    financial establishment.
  • We live nearer the water than you thought.

11
Non-categorical phenomena in language (2)
  • Language change kind of and sort of
  • Normal noun - preposition
  • What sort of animal made these tracks?
  • Degree modifiers(adverb)
  • We are kind of hungry.
  • He sort of understood what was going on.
  • Before 19c Clearly noun
  • A nette sent in to the see, and of alle kind of
    fishis gedrynge. 1382
  • I knowe that sorte of men ryght well. 1560
  • Since 19c Appear use of degree modifier
  • I kind of love you, SalI vow. 1804
  • It sort o stirs one up to hear about old times.
    1833

12
Non-categorical phenomena in language (3)
  • Language change kind of and sort of (cont.)
  • Between 16c and 19c, it grew to look
    syntactically more like a degree modifier.
  • Their finest and best, is a kind of
    course(coarse) red cloth. c. 1600 (noun
    preposition)
  • But in such questions as the present, a hundred
    contradictory views may preserve a kind of
    imperfect analogy. 1743 (degree modifier)
  • This frequency change seems to have driven a
    change in syntactic category.
  • e.g. modifying verb phrases
  • Details of gradual language change can only be
    made a sense of by examining frequencies of use.

13
Language and cognition as probabilistic phenomena
  • Human cognition is probabilistic.
  • Language must therefore be probabilistic too
    since it is an integral part of cognition.
  • If language and cognition are best explained
    probablistically,
  • Then probability theory must be a central part of
    explanatory theory of language.

14
The Ambiguity of Language Why NLP Is Difficult
(1)
  • e.g. Ambiguity in syntactic analysis(parse)
  • Our company is training workers. 3 parses
  • (b) (c) semantically anomalous

15
The Ambiguity of Language Why NLP Is Difficult
(2)
  • As sentences get longer and grammars get more
    comprehensive
  • Ambiguities lead to a terrible multiplication of
    parses.
  • List the sales of the products produced in 1973
    with the products produced in 1972. 455parses
  • There are many ambiguity in
  • Word sense, category, syntactic structure and
    semantic scope.
  • The goal of maximizing coverage while minimizing
    ambiguity is
  • Fundamentally inconsistent with symbolic NLP
    systems.

16
The Ambiguity of Language Why NLP Is Difficult
(3)
  • Manual rule creation and hand-tuning
  • Time consuming to build.
  • Do not scale up well.
  • Produce a knowledge acquisition bottleneck.
  • Perform poorly.
  • Majority of Statistical models in disambiguation
  • Robust.
  • Behave gracefully
  • In the presence of errors and new data.
  • Automatic learning
  • Reduce the human effort in producing NLP system.

17
Dirty Hands
  • Lexical resources
  • Word counts
  • Zipfs laws
  • Collocations
  • Concordances

18
Lexical resources
Dirty Hands
  • Corpus
  • Collection of machine-readable texts(MRT)
  • Type
  • Raw corpus vs. Tagged corpus
  • Balanced corpus vs. Unbalanced corpus
  • Monolingual corpus vs. Multilingual(e.g.
    Bilingual) corpus
  • Famous Corpora
  • English Brown, LOB(Lancaster-Oslo-Bergen),
    Susanne, BNC(British National Corpus) and Penn
    Treebank
  • Korean 21?? ???? - ?? ???? ??
  • Bilingual Canadian Hansards(English and French)
  • Other lexical resources
  • Dictionary, Thesaurus, and also tools
  • WordNet

19
Word counts (1)
Dirty Hands
  • What is the most common words in the text?
  • Table 1.1 Common words in Tom Sawyer.
  • It have important grammatical roles, and usually
    referred to as function words.

20
Word counts (2)
Dirty Hands
  • Tokens
  • Individual occurrences of something.
  • Types
  • The different things present.
  • Word tokens
  • Individual occurrences of words.
  • e.g. word tokens in Tom Sawyer 71,370
  • Word types
  • The different words appear in the text.
  • e.g. word types in Tom Sawyer 8,108

21
Word counts (3)
Dirty Hands
  • Ratio of tokens to types (type-token ratio)
  • Average frequency with which each type is used.
  • Table 1.2 Frequency of frequencies of word types
    in Tom Sawyer.
  • What makes frequency-based approaches to language
    hard is that almost all words are rare.

22
Zipfs Laws (1)
Dirty Hands
  • The famous law Zipfs law
  • The frequency distribution of words
  • f r k
  • f the frequency
  • r the rank
  • k constant
  • Table 1.3 Empirical evaluation of Zipfs law on
    Tom Sawyer.
  • Main upshot
  • For most words our data about their use will be
    exceedingly sparse.
  • Only for a few words will we have lots of
    examples.

23
Zipfs Laws (2)
Dirty Hands
  • Mandelbrots formula
  • P , ? and B constant parameters
  • If B 1 and ? 0, it simplifies to Zipfs law.

24
Zipfs Laws (3)
Dirty Hands
  • Figure 1.1
  • Zips law
  • k 100,000
  • Figure 1.2
  • Mandelbrots formula
  • P 105.4,B 1.15, ? 100

25
Zipfs Laws (4)
Dirty Hands
  • Other laws
  • Correlation between the number of meaning and
    frequency
  • m the number of meanings of a word
  • Tendency of content words to clump(????).
  • F frequency
  • I interval sizes
  • p constant between 1 and 1.3

26
Collocation(??) (1)
Dirty Hands
  • Any turn of phrase or accepted usage that people
    repeat.
  • Compounds disk drive
  • Phrasal verbs make up
  • Stock phrases(idioms) bacon and eggs
  • Needed to normalizing and filtering.
  • Important in machine translation(MT) and
    information retrieval(IR).

27
Collocation (2)
  • Table 1.4
  • Commonest bigram collocations in the NYT.
  • Table 1.5
  • Frequent bigrams after POS filtering.

28
Concordance(??) (1)
Dirty Hands
  • Collecting information about patterns of
    occurrence of words or phrases.
  • Can be useful
  • Not only for purposes such as dictionaries for
    learners of foreign languages,
  • But for use in guiding statistical parsers.

29
Concordance (2)
  • KWIC(Key Word In Context) concordancing program
Write a Comment
User Comments (0)
About PowerShow.com