Basics of Natural Language Processing - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Basics of Natural Language Processing

Description:

Basics of Natural Language Processing. Aims of Linguistic Science ... Approach becomes statistical = Statistical Natural Language Processing. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 27
Provided by: edwardj8
Category:

less

Transcript and Presenter's Notes

Title: Basics of Natural Language Processing


1
Lecture 5
  • Basics of Natural Language Processing

2
Aims of Linguistic Science
  • Characterize and explain the linguistics
    observations
  • Conversation
  • Writing
  • Other media
  • How humans acquire, produce and use language
  • Relationship between linguistic utterances and
    the world
  • Understand linguistic structures by which
    language communicates
  • Rules

3
All grammars leek!
  • Grammars attempt to describe well-formed versus
    ill-formed utterances
  • Not possible to give an exact and complete
    characterization that cleanly divides.
  • People are always stretching and bending rules

4
Alternate Approach
  • Abandon the idea of dividing sentences into
    grammatical and non-grammatical ones.
  • Ask What are the common patterns that occur in
    language use?
  • Approach becomes statistical gt Statistical
    Natural Language Processing.

5
Rationalist Approach to LP
  • Dominant from 1960-1985
  • Prevalent in linguistics, psychology, artificial
    intelligence, natural language processing
  • Characterized by the belief that a significant
    part of the knowledge in the human mind is not
    derived by senses, but is fixed in advance
    genetic inheritance.
  • Within linguistics, the rationalist position has
    come to dominate the field by the widespread
    acceptance of arguments by Noam Chomsky for
    innate language facility.

6
Poverty of Stimulus
  • How can children learn something as complex as
    natural language from the limited input they hear
    during their early years?
  • Rationalist approach says key parts of language
    are innate hardwired in the brain at birth as
    part of human genetic inheritance.

7
Empiricist Approach
  • Dominant from 1920-1960 and re-emerging now.
  • Agree that some cognitive abilities are present
    in the brain.
  • But the thrust of the empiricist approach is that
    the mind does not begin with detailed sets of
    principles and procedures specific to various
    components of language and other cognitive
    domains.

8
Generative Linguistics
  • Chomskyan or generative linguistics seeks to
    describe the language module in the human brain
    (the I-language) for which data such as texts
    (the E-language) provide only indirect evidence.
  • Distinguish between linguistic competence which
    reflects the knowledge of language structure that
    is in the mind of a native speaker and
  • Linguistic performance in the world which is
    affected by factors from the real world such as
    memory limitations and noise.

9
Statistical NLP
  • The aim is to assign probabilities to linguistic
    events so that one can say which sentences are
    usual and which are unusual.
  • Interested in good descriptions of associations
    and preferences that occur in the totality of
    language use.

10
Questions Linguistics Should Answer
  • What do people say?
  • What do these things say/ask/request about the
    world?
  • Patterns in corpora more easily reveal the
    syntactic structure of language and so
    statistical NLP deals principally with the first
    question.
  • Generative linguistics abstracts away any attempt
    to describe the things people actually say but
    seeks to describe a competence grammar that is
    said to underlie the language. What is resident
    in peoples minds.

11
Grammaticality
  • The concept of grammaticality is meant to be
    judged on whether a sentence is structurally
    well-formed.
  • Not on whether it is the kind of things people
    would say.
  • Not on whether it is semantically meaningful
  • Colorless green ideas sleep furiously.

12
Blending of Parts of Speech
  • Near as adjective or preposition
  • We will review that decision in the near future.
  • Adjective
  • He lives near the station.
  • Preposition
  • We nearly lost.
  • Adjective gt adverb
  • He lives right near the station.
  • Preposition modified by adjective
  • We live nearer the water than you thought.
  • Proposition in comparative form

13
Language Change
  • Two uses of kind of and sort of.
  • What sort of animal made these tracks?
  • Noun
  • We are kind of hungry.
  • Adjective (degree modifiers) similar to somewhat.
  • He sort of understood what was going on.
  • Adverb (degree modifier).
  • The nette sent in to the see, and alle kind of
    fishis gedrynge. 1382
  • I knowe that sorte of men ryght well. 1560
  • I kind of love you, Sal. 1804
  • It sort o stirs one up to hear about old times.
    1833

14
Language Change
  • While language change can be sudden, it is
    generally gradual.
  • The details of gradual change can only be made
    sense of by examining frequencies of use and
    being sensitive to varying strengths of
    relationships.
  • This type of modeling requires statistical as
    opposed to categorical observations
  • Human cognition is probabilistic and so language
    must be probabilistic too.
  • This implies probability is key to scientific
    understanding of language.

15
Disambiguation
  • I have given several examples in previous
    lectures of ambiguous sentences.
  • NLP System must be good at making disambiguation
    decisions with respect to word sense, word
    category, syntactic structure, and semantic
    scope.
  • Hand-coded syntactic constraints and preference
    rules are time consuming to build, do not scale
    well, and are brittle in the face of the
    extensive use of metaphor in language.

16
Disambiguation
  • A traditional approach is to use sectional
    restrictions.
  • For example, a verb like swallow requires an
    animate being as its subject and a physical
    object as its object.
  • Counterexamples.
  • I swallowed his story, hook, line, and sinker.
  • The supernova swallowed the planet.

17
Getting Hands Dirty
  • Lexical Resources machine-readable text,
    dictionaries, thesauri, and the tools for
    processing them.
  • Brown Corpus
  • A tagged corpus of about 1,000,000 words
    assembled at Brown University in the 1960s and
    1970s.
  • Lancaster-Oslo-Bergen Corpus
  • British version.
  • Susanne Corpus
  • Free subset of Brown Corpus
  • Penn Treebank
  • From Wall Street Journal
  • Canadian Hansards
  • Proceedings of Canadian parliament a bilingual
    corpus

18
Word Counts
Word tokens versus word types. Word tokens are
the number of words. In Tom Sawyer there are
71,370 word tokens.
19
Word Counts
In contrast, word types refers to the number of
distinct words, some of which are repeated. In
Tom Sawyer, there are 8,018. One can calculate
the ratio of tokens to types, which is just the
average frequency of each word. The ratio is 8.6
20
Zipfs Law
  • If we count how often each word type occurs in a
    large corpus, then list the words in the order of
    frequency of occurrence, we can explore the
    relationship between the frequency of a word, f,
    and its position in the list, known as its rank,
    r.
  • Zipfs law says f? 1/r or equivalently
    frconstant.

21
Word Counts
22
Zipfs Law
  • The product fr tends to bulge for words of rank
    around 100.
  • For human languages Zipfs law is a useful rough
    description of the frequency distribution there
    are a few very common words, a medium amount of
    medium frequency words and many low frequency
    words.

23
Zipfs Law
24
Mandelbrots Law
  • Mandelbrot derived a better fit.
  • f P(r ?)-B
  • Here P, B, and ? are parameters of a text that
    collectively measure the richness of the texts
    use of words.

25
Mandelbrots Law
26
Other Laws
  • If m is the number of meanings a word can have,
    then Zipf argues m ? f ½.
  • Equivalently, m ? r - ½.
  • Power Laws
  • The probability of a word of length n being
    generated at random is (26/27)n(1/27)
  • There are 26 times more words of length n1 than
    words of length n.
  • There is a constant ratio by which words of
    length n are more frequent that words of length
    n1.
Write a Comment
User Comments (0)
About PowerShow.com