Title: Ch1. Introduction
1Ch1. Introduction
Foundations of Statistical Natural Language
Processing
- ?????
- ????????
- ???
- 2002.01.10
2Contents
- Rationalist and Empiricist Approaches to Language
- Scientific Content
- The Ambiguity of Language Why NLP Is Difficult
- Dirty Hands
3Rationalist and Empiricist Approaches to Language
(1)
- Rationalist(?????) Approach
- 19601985, Chomsky
- Distinct crucially between
- Linguistic Competence (?? ??)
- Linguistic Performance (?? ??)
- The key parts of language are innate.
- Hardwired in the brain at birth as parts of human
genetic inheritance. - Rule-based approach
- Grammar rules are already exist in the beginning.
4Rationalist and Empiricist Approaches to Language
(2)
- Empiricist(?????) Approach
- Organizing and generalizing the linguistic
knowledge from the sensory input. - An empiricist approach to NLP
- Specifying an appropriate general language model.
- Inducing the values of parameters
- By statistical, pattern recognition, and machine
learning method. - Corpus-based approach
- You shall know a word by the company it
keeps.(J.R.Firth) - Currently wide-spread.
5Rationalist and Empiricist Approaches to Language
(3)
- Rationalist vs. Empiricist
- Chomskyan(generative) linguists
- Categorical(???) principles
- Saying categorical judgment(????)
- Sentences either do or do not satisfy the rule.
- Statistical NLP practitioners
- Assigning probabilities to linguistic events
- Saying which sentences are usual and unusual.
6Scientific Content
- Questions that linguistics should answer
- Non-categorical phenomena in language
- Language and cognition as probabilistic phenomena
7Questions that linguistics should answer (1)
- Two Basic Questions
- What kinds of things do people say?
- Covers all aspects of structure of language
- What do these things say/ask/request about the
world? - Deals with
- Semantics(???)
- Pragmatics(???,???) ??? ???? ?? ??
- Discourse(???) ???? ? ??? ??
8Questions that linguistics should answer (2)
- Grammaticality(???)
- Judged purely on whether a sentence is
structurally well-formed. - Not according whether it is the kind of thing
that people would say or whether it is
semantically anomalous. - e.g. Colorless green ideas sleep furiously.
(Chomsky) - Grammatical, although semantically strange.
9Questions that linguistics should answer (3)
- Conventionality(???)
- Simply a way in which people frequently express,
- Even though other ways are in principle possible.
- Non-native speakers often say something
ungrammatically - We can understand that.
- But it would sound better expressed slightly
differently.
10Non-categorical phenomena in language (1)
- Blending of parts of speech near
- Not blended
- We will review that decision in the near future.
(adjective) - He lives near the station. (preposition)
- Blended
- He has never been nearer the center of the
financial establishment. - We live nearer the water than you thought.
11Non-categorical phenomena in language (2)
- Language change kind of and sort of
- Normal noun - preposition
- What sort of animal made these tracks?
- Degree modifiers(adverb)
- We are kind of hungry.
- He sort of understood what was going on.
- Before 19c Clearly noun
- A nette sent in to the see, and of alle kind of
fishis gedrynge. 1382 - I knowe that sorte of men ryght well. 1560
- Since 19c Appear use of degree modifier
- I kind of love you, SalI vow. 1804
- It sort o stirs one up to hear about old times.
1833
12Non-categorical phenomena in language (3)
- Language change kind of and sort of (cont.)
- Between 16c and 19c, it grew to look
syntactically more like a degree modifier. - Their finest and best, is a kind of
course(coarse) red cloth. c. 1600 (noun
preposition) - But in such questions as the present, a hundred
contradictory views may preserve a kind of
imperfect analogy. 1743 (degree modifier) - This frequency change seems to have driven a
change in syntactic category. - e.g. modifying verb phrases
- Details of gradual language change can only be
made a sense of by examining frequencies of use.
13Language and cognition as probabilistic phenomena
- Human cognition is probabilistic.
- Language must therefore be probabilistic too
since it is an integral part of cognition. - If language and cognition are best explained
probablistically, - Then probability theory must be a central part of
explanatory theory of language.
14The Ambiguity of Language Why NLP Is Difficult
(1)
- e.g. Ambiguity in syntactic analysis(parse)
- Our company is training workers. 3 parses
- (b) (c) semantically anomalous
15The Ambiguity of Language Why NLP Is Difficult
(2)
- As sentences get longer and grammars get more
comprehensive - Ambiguities lead to a terrible multiplication of
parses. - List the sales of the products produced in 1973
with the products produced in 1972. 455parses - There are many ambiguity in
- Word sense, category, syntactic structure and
semantic scope. - The goal of maximizing coverage while minimizing
ambiguity is - Fundamentally inconsistent with symbolic NLP
systems.
16The Ambiguity of Language Why NLP Is Difficult
(3)
- Manual rule creation and hand-tuning
- Time consuming to build.
- Do not scale up well.
- Produce a knowledge acquisition bottleneck.
- Perform poorly.
- Majority of Statistical models in disambiguation
- Robust.
- Behave gracefully
- In the presence of errors and new data.
- Automatic learning
- Reduce the human effort in producing NLP system.
17Dirty Hands
- Lexical resources
- Word counts
- Zipfs laws
- Collocations
- Concordances
18Lexical resources
Dirty Hands
- Corpus
- Collection of machine-readable texts(MRT)
- Type
- Raw corpus vs. Tagged corpus
- Balanced corpus vs. Unbalanced corpus
- Monolingual corpus vs. Multilingual(e.g.
Bilingual) corpus - Famous Corpora
- English Brown, LOB(Lancaster-Oslo-Bergen),
Susanne, BNC(British National Corpus) and Penn
Treebank - Korean 21?? ???? - ?? ???? ??
- Bilingual Canadian Hansards(English and French)
- Other lexical resources
- Dictionary, Thesaurus, and also tools
- WordNet
19Word counts (1)
Dirty Hands
- What is the most common words in the text?
- Table 1.1 Common words in Tom Sawyer.
- It have important grammatical roles, and usually
referred to as function words.
20Word counts (2)
Dirty Hands
- Tokens
- Individual occurrences of something.
- Types
- The different things present.
- Word tokens
- Individual occurrences of words.
- e.g. word tokens in Tom Sawyer 71,370
- Word types
- The different words appear in the text.
- e.g. word types in Tom Sawyer 8,108
21Word counts (3)
Dirty Hands
- Ratio of tokens to types (type-token ratio)
- Average frequency with which each type is used.
- Table 1.2 Frequency of frequencies of word types
in Tom Sawyer. - What makes frequency-based approaches to language
hard is that almost all words are rare.
22Zipfs Laws (1)
Dirty Hands
- The famous law Zipfs law
- The frequency distribution of words
- f r k
- f the frequency
- r the rank
- k constant
- Table 1.3 Empirical evaluation of Zipfs law on
Tom Sawyer. - Main upshot
- For most words our data about their use will be
exceedingly sparse. - Only for a few words will we have lots of
examples.
23Zipfs Laws (2)
Dirty Hands
- Mandelbrots formula
- P , ? and B constant parameters
- If B 1 and ? 0, it simplifies to Zipfs law.
24Zipfs Laws (3)
Dirty Hands
- Figure 1.1
- Zips law
- k 100,000
- Figure 1.2
- Mandelbrots formula
- P 105.4,B 1.15, ? 100
25Zipfs Laws (4)
Dirty Hands
- Other laws
- Correlation between the number of meaning and
frequency - m the number of meanings of a word
- Tendency of content words to clump(????).
- F frequency
- I interval sizes
- p constant between 1 and 1.3
26Collocation(??) (1)
Dirty Hands
- Any turn of phrase or accepted usage that people
repeat. - Compounds disk drive
- Phrasal verbs make up
- Stock phrases(idioms) bacon and eggs
- Needed to normalizing and filtering.
- Important in machine translation(MT) and
information retrieval(IR).
27Collocation (2)
- Table 1.4
- Commonest bigram collocations in the NYT.
- Table 1.5
- Frequent bigrams after POS filtering.
28Concordance(??) (1)
Dirty Hands
- Collecting information about patterns of
occurrence of words or phrases. - Can be useful
- Not only for purposes such as dictionaries for
learners of foreign languages, - But for use in guiding statistical parsers.
29Concordance (2)
- KWIC(Key Word In Context) concordancing program