Foundation of Statistical Natural Language Processing 1. Introduction - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Foundation of Statistical Natural Language Processing 1. Introduction

Description:

( adjective) He lives near the station. ( preposition) We live ... Blending of adjective and preposition. Grammatically strange, but commonly use. Language change ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 20
Provided by: nlpKo
Category:

less

Transcript and Presenter's Notes

Title: Foundation of Statistical Natural Language Processing 1. Introduction


1
Foundation of StatisticalNatural Language
Processing1. Introduction
  • ???
  • 1999.12.30

2
Abstract(1)
  • The AIM of linguistic science
  • Cognizing how humans manage language
  • Understanding the relationship between linguistic
    utterances and the world
  • Understanding the linguistic structures
  • Rationalists approach
  • Grammaticality
  • Empiricists approach
  • Conventionality

3
Abstract(2)
  • WHY statistical and HOW
  • WHY
  • Better at automatic learning (knowledge
    induction)
  • Better at disambiguation
  • HOW
  • Counting words
  • Collocations
  • Concordances

4
Contents
  • Rationalist and Empiricist Approaches to Language
  • Scientific Content
  • The Ambiguity of Language
  • Why NLP Is Difficult
  • Dirty Hands

5
Rationalist and EmpiricistApproaches to
Language(1)
  • ?Corpus
  • A body of texts
  • Surrogate for situating language in real world
    context

6
Rationalist and EmpiricistApproaches to
Language(2)
7
Scientific Content(1)
  • Questions that linguistics should answer
  • Two basic question
  • What kinds of things do people say?
  • All aspects of the structure of language
  • What do these things say/ask/request about the
    world?
  • How to connect utterances with the world

8
Scientific Content(2)
  • Grammaticality and Conventionality
  • Grammaticality
  • Whether a sentence is structurally well-formed
  • Categorical perception
  • Rationalism
  • Conventionality
  • People frequently express or do something
  • Frequency of use
  • Empiricism

9
Scientific Content(3)
  • Non-categorical phenomena in language
  • Blending of parts of speech
  • We will review that decision in the near future.
    (adjective)
  • He lives near the station. (preposition)
  • We live nearer the water than you thought.
  • Blending of adjective and preposition
  • Grammatically strange, but commonly use
  • Language change
  • What sort of animal made these track?
  • noun and preposition
  • He sort of understood what was going on.
  • somewhat, slightly
  • Language change is generally gradual
  • Statistical observation required for modeling it

10
Scientific Content(4)
  • Language and cognition as probabilistic phenomena
  • Human cognition is probabilistic!
  • Probabilistic approach
  • Good at uncertainty and incomplete information
  • Sentence, unattested in corpus
  • Major subject of Statistical NLP is
  • deriving good probability estimates for unseen
    events
  • A use of meaning
  • The meaning of a word is defined by the
    circumstances of its use

11
The Ambiguity of Language- Why NLP Is Difficult
  • Three parses example
  • Our company is training workers.
  • S(NP(Our company) VP(Aux(is) VP(V(training)
    NP(workers))))
  • S(NP(Our company) VP(V(is) NP(VP(V(training)
    NP(workers)))))
  • S(NP(Our company) VP(V(is) NP(AdjP(training)
    N(workers))))
  • These various parses(ambiguity) make NLP
    difficult!
  • Statistical NLP approach
  • Using corpora
  • Automatically learning
  • Lexical and structural preferences from corpora
  • Preferences are better at solving ambiguity!

12
Dirty Hands(1)
  • Lexical resources
  • Brown corpus
  • Tagged corpus of about a million words that was
    put together at Brown university
  • Balanced corpus
  • Representative sample of American English at the
    time
  • Lancaster-Oslo-Bergen (LOB) corpus
  • A British English replication of the Brown corpus
  • Susanne corpus
  • A 130,000 word subset of the Brown corpus

13
Dirty Hands(2)
  • Penn Treebank
  • A larger corpus of syntactically annotated
    sentences
  • Canadian Hansards (bilingual)
  • The proceedings of the Canadian parliament
  • WordNet
  • An electronic dictionary of English
  • ?Bilingual corpus
  • A corpus that contains parallel texts in two or
    more languages

14
Dirty Hands(3)
  • Zipfs law
  • Empirical evaluation of Zipfs low on Tom Sawyer

15
Dirty Hands(4)
  • Mandelbrots formula
  • (P, B and ? are parameters of text)
  • Evaluation of upper two on Brown corpus

16
Dirty Hands(5)
  • Other laws
  • Number of meanings (m)
  • Tendency of content words to clump
  • (I interval sizes, p variable)

17
Dirty Hands(6)
  • Significance of power laws
  • The probability of a word of length n being
    generated is
  • The key insights
  • There are 26 times more words of length n1 than
    n
  • There is constant ratio by which words of length
    n are more frequent than n1

18
Dirty Hands(7)
  • Collocations
  • Any turn of phrase or accepted usage
  • Collocation includes
  • Compounds (disk drive)
  • Phrasal verbs (make up)
  • And other stock phrases (bacon and eggs)
  • Collocation is important to
  • Machine translation
  • A word my be translated differently according to
    the collocation it occurs in
  • Information retrieval system
  • May want to index only interesting phrases

19
Dirty Hands(8)
  • Concordances
  • Key Word In Context (KWIC)
  • See the figure 1.3 on pp. 32
  • Syntactic frames
  • See the figure 1.4 on pp. 33
  • Useful for
  • Dictionaries for learner of foreign languages
  • Guiding statistical parsers
Write a Comment
User Comments (0)
About PowerShow.com