Basics of Natural Language Processing - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Basics of Natural Language Processing

Description:

Basics of Natural Language Processing. Aims of Linguistic Science ... Approach becomes statistical = Statistical Natural Language Processing. ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 27

Provided by: edwardj8

Category:

more less

Transcript and Presenter's Notes

Title: Basics of Natural Language Processing

1
Lecture 5

Basics of Natural Language Processing

2
Aims of Linguistic Science

Characterize and explain the linguistics
observations
Conversation
Writing
Other media
How humans acquire, produce and use language
Relationship between linguistic utterances and
the world
Understand linguistic structures by which
language communicates
Rules

3
All grammars leek!

Grammars attempt to describe well-formed versus
ill-formed utterances
Not possible to give an exact and complete
characterization that cleanly divides.
People are always stretching and bending rules

4
Alternate Approach

Abandon the idea of dividing sentences into
grammatical and non-grammatical ones.
Ask What are the common patterns that occur in
language use?
Approach becomes statistical gt Statistical
Natural Language Processing.

5
Rationalist Approach to LP

Dominant from 1960-1985
Prevalent in linguistics, psychology, artificial
intelligence, natural language processing
Characterized by the belief that a significant
part of the knowledge in the human mind is not
derived by senses, but is fixed in advance
genetic inheritance.
Within linguistics, the rationalist position has
come to dominate the field by the widespread
acceptance of arguments by Noam Chomsky for
innate language facility.

6
Poverty of Stimulus

How can children learn something as complex as
natural language from the limited input they hear
during their early years?
Rationalist approach says key parts of language
are innate hardwired in the brain at birth as
part of human genetic inheritance.

7
Empiricist Approach

Dominant from 1920-1960 and re-emerging now.
Agree that some cognitive abilities are present
in the brain.
But the thrust of the empiricist approach is that
the mind does not begin with detailed sets of
principles and procedures specific to various
components of language and other cognitive
domains.

8
Generative Linguistics

Chomskyan or generative linguistics seeks to
describe the language module in the human brain
(the I-language) for which data such as texts
(the E-language) provide only indirect evidence.
Distinguish between linguistic competence which
reflects the knowledge of language structure that
is in the mind of a native speaker and
Linguistic performance in the world which is
affected by factors from the real world such as
memory limitations and noise.

9
Statistical NLP

The aim is to assign probabilities to linguistic
events so that one can say which sentences are
usual and which are unusual.
Interested in good descriptions of associations
and preferences that occur in the totality of
language use.

10
Questions Linguistics Should Answer

What do people say?
What do these things say/ask/request about the
world?
Patterns in corpora more easily reveal the
syntactic structure of language and so
statistical NLP deals principally with the first
question.
Generative linguistics abstracts away any attempt
to describe the things people actually say but
seeks to describe a competence grammar that is
said to underlie the language. What is resident
in peoples minds.

11
Grammaticality

The concept of grammaticality is meant to be
judged on whether a sentence is structurally
well-formed.
Not on whether it is the kind of things people
would say.
Not on whether it is semantically meaningful
Colorless green ideas sleep furiously.

12
Blending of Parts of Speech

Near as adjective or preposition
We will review that decision in the near future.
Adjective
He lives near the station.
Preposition
We nearly lost.
Adjective gt adverb
He lives right near the station.
Preposition modified by adjective
We live nearer the water than you thought.
Proposition in comparative form

13
Language Change

Two uses of kind of and sort of.
What sort of animal made these tracks?
Noun
We are kind of hungry.
Adjective (degree modifiers) similar to somewhat.
He sort of understood what was going on.
Adverb (degree modifier).
The nette sent in to the see, and alle kind of
fishis gedrynge. 1382
I knowe that sorte of men ryght well. 1560
I kind of love you, Sal. 1804
It sort o stirs one up to hear about old times.
1833

14
Language Change

While language change can be sudden, it is
generally gradual.
The details of gradual change can only be made
sense of by examining frequencies of use and
being sensitive to varying strengths of
relationships.
This type of modeling requires statistical as
opposed to categorical observations
Human cognition is probabilistic and so language
must be probabilistic too.
This implies probability is key to scientific
understanding of language.

15
Disambiguation

I have given several examples in previous
lectures of ambiguous sentences.
NLP System must be good at making disambiguation
decisions with respect to word sense, word
category, syntactic structure, and semantic
scope.
Hand-coded syntactic constraints and preference
rules are time consuming to build, do not scale
well, and are brittle in the face of the
extensive use of metaphor in language.

16
Disambiguation

A traditional approach is to use sectional
restrictions.
For example, a verb like swallow requires an
animate being as its subject and a physical
object as its object.
Counterexamples.
I swallowed his story, hook, line, and sinker.
The supernova swallowed the planet.

17
Getting Hands Dirty

Lexical Resources machine-readable text,
dictionaries, thesauri, and the tools for
processing them.
Brown Corpus
A tagged corpus of about 1,000,000 words
assembled at Brown University in the 1960s and
1970s.
Lancaster-Oslo-Bergen Corpus
British version.
Susanne Corpus
Free subset of Brown Corpus
Penn Treebank
From Wall Street Journal
Canadian Hansards
Proceedings of Canadian parliament a bilingual
corpus

18
Word Counts
Word tokens versus word types. Word tokens are
the number of words. In Tom Sawyer there are
71,370 word tokens.
19
Word Counts
In contrast, word types refers to the number of
distinct words, some of which are repeated. In
Tom Sawyer, there are 8,018. One can calculate
the ratio of tokens to types, which is just the
average frequency of each word. The ratio is 8.6
20
Zipfs Law

If we count how often each word type occurs in a
large corpus, then list the words in the order of
frequency of occurrence, we can explore the
relationship between the frequency of a word, f,
and its position in the list, known as its rank,
r.
Zipfs law says f? 1/r or equivalently
frconstant.

21
Word Counts
22
Zipfs Law

The product fr tends to bulge for words of rank
around 100.
For human languages Zipfs law is a useful rough
description of the frequency distribution there
are a few very common words, a medium amount of
medium frequency words and many low frequency
words.

23
Zipfs Law
24
Mandelbrots Law

Mandelbrot derived a better fit.
f P(r ?)-B
Here P, B, and ? are parameters of a text that
collectively measure the richness of the texts
use of words.

25
Mandelbrots Law
26
Other Laws

If m is the number of meanings a word can have,
then Zipf argues m ? f ½.
Equivalently, m ? r - ½.
Power Laws
The probability of a word of length n being
generated at random is (26/27)n(1/27)
There are 26 times more words of length n1 than
words of length n.
There is a constant ratio by which words of
length n are more frequent that words of length
n1.