Ch1' Introduction - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Ch1' Introduction

Description:

Foundations of Statistical Natural Language Processing. Natural Language ... Inducing the values of parameters by statistical method, pattern recognition, ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 12
Provided by: bolo8
Category:

less

Transcript and Presenter's Notes

Title: Ch1' Introduction


1
Ch1. Introduction
Foundations of Statistical Natural Language
Processing
  • ?????
  • ????????
  • ???
  • 2001.01.03

2
Contents
  • Rationalist and Empiricist Approaches to Language
  • Motivation of Statistical NLP
  • Dirty Hands
  • Lexical resources
  • Word counts
  • Zipfs laws
  • Collocations and Concordances

3
Rationalist and Empiricist Approaches to Language
(1/2)
  • Rationalist Approach
  • 19601985, Chomsky
  • the key parts of language are innate
  • ? Grammar rules are already exist in the
    beginning.
  • Rule-based approach
  • Categorical principles
  • Saying that sentences either do or do not satisfy
    the rule

4
Rationalist and Empiricist Approaches to Language
(2/2)
  • Empiricist Approach
  • Currently wide-spread
  • organizing and generalizing the linguistic
    knowledge from the sensory input
  • Corpus-based approach
  • Assigning probabilities to linguistic events
  • Saying which sentences are usual and unusual
  • An empiricist approach to NLP
  • Specifying an appropriate general language model
  • Inducing the values of parameters by statistical
    method, pattern recognition, and machine learning

5
Motivation of Statistical NLP (1/3)
  • Conventionality
  • Judge on whether it is the kind of thing that
    people would say or whether it is semantically
    anomalous
  • colorless green ideas sleep furiously
  • non-native speakers often say something
    ungrammatically, but we understand that
  • Explanation of Non-categorical phenomena
  • Blending of parts of speech near
  • Language change kind of and sort of
  • Human cognition is probabilistic
  • Language must therefore be probabilistic too
    since it is an integral part of cognition

6
Motivation of Statistical NLP (2/3)
  • Disambiguation
  • Difficulty of disambiguation in symbolic NLP
    system
  • There are many ambiguity in word sense, category,
    syntactic structure and semantic scope
  • As sentences get longer and grammars get more
    comprehensive, ambiguities lead to a terrible
    multiplication of parses
  • Our company is training workers 3
  • List the sales of the products produced in 1973
    with the products produced in 1972 455
  • The goal of maximizing coverage while minimizing
    resultant ambiguity is fundamentally inconsistent
  • Manual rule creation and hand-tuning
  • Time consuming to build
  • Do not scale up well
  • Produce a knowledge acquisition bottleneck
  • Perform poorly

7
Motivation of Statistical NLP (3/3)
  • Disambiguation (cont.)
  • Majority of Statistical model in disambiguation
  • Robust
  • Behave gracefully in the presence of errors and
    new data
  • Automatic learning
  • Reduce the human effort in producing NLP system

8
Lexical Resources
Dirty Hands
  • Corpus
  • Collection of machine-readable texts
  • Type
  • Raw corpus vs. Tagged corpus
  • Balanced corpus vs. unbalanced corpus
  • Monolingual corpus vs. multilingual corpus
  • Etc dictionary, thesaurus
  • Institutions
  • LDC (Linguistic Data Consortium)
  • http//www.ldc.uppen.edu
  • ELRA (European Language Resources Association)
  • http//www.icp.grenet.fr/ELRA

9
Word Counts
Dirty Hands
  • What is the most common words in the text
  • Table 1.1
  • It have important grammatical roles, and usually
    referred to as function words
  • Word tokens and word types
  • Word tokens
  • Individual occurrences of words (the length of
    text)
  • Word types
  • The different words appear in the text
  • Average frequency
  • The ratio of tokens to types
  • Table 1.2
  • What makes frequency-based approaches to language
    hard is that almost all words are rare

10
Zipfs Laws
Dirty Hands
  • Zipfs law
  • The frequency distribution of words
  • f r k
  • f the frequency of a word
  • r the rank of a word
  • k constant
  • Table 1.3
  • For most words our data about their use will be
    exceedingly sparse
  • Mandelbrots formula
  • f P / (r ?)-B
  • f the frequency of a word
  • r the rank of a word
  • P, ?,B constant
  • Other laws
  • m ? f1/2
  • m the number of meanings of a word

11
Collocations and Concordances
Dirty Hands
  • Collocation
  • Any turn of phrase or accepted usage that people
    repeat
  • Include compounds, phrasal verbs, and idioms
  • Needed to normalizing and filtering
  • Important in machine translation and information
    retrieval
  • Concordance
  • Collecting information about patterns of
    occurrence of words or phrases
  • Can be useful for purposes such as dictionaries
    for learners of foreign languages, but for use in
    guiding statistical parsers
Write a Comment
User Comments (0)
About PowerShow.com