LING 572 - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

LING 572

Description:

Introduction. LING 572. Fei Xia. Week 1: 1/3/06. Outline. Course overview. Problems and methods ... Introduction to probability and statistics. Expectations. Reading: ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 50
Provided by: facultyWa4
Category:

less

Transcript and Presenter's Notes

Title: LING 572


1
Introduction
  • LING 572
  • Fei Xia
  • Week 1 1/3/06

2
Outline
  • Course overview
  • Problems and methods
  • Mathematical foundation
  • Probability theory
  • Information theory

3
Course overview
4
Course objective
  • Focus on statistical methods that produce
    state-of-the-art results
  • Questions for each algorithm
  • How the algorithm works input, output, steps
  • What kind of tasks an algorithm can be applied
    to?
  • How much data is needed?
  • Labeled data
  • Unlabeled data

5
General info
  • Course website
  • Syllabus (incl. slides and papers) updated every
    week.
  • Message board
  • ESubmit
  • Office hour W 3-5pm.
  • Prerequisites
  • Ling570 and Ling571.
  • Programming C, C, or Java, Perl is a plus.
  • Introduction to probability and statistics

6
Expectations
  • Reading
  • Papers are online who dont have access to
    printers?
  • Reference book Manning Schutze (MS)
  • Finish reading before class. Bring your questions
    to class.
  • Grade
  • Homework (3) 30
  • Project (6 parts) 60
  • Class participation 10
  • No quizzes, exams

7
Assignments
  • Hw1 FSA and HMM
  • Hw2 DT, DL, and TBL.
  • Hw3 Boosting
  • No coding
  • Bring the finished assignments to class.

8
Project
  • P1 Method 1 (Baseline) Trigram
  • P2 Method 2 TBL
  • P3 Method 3 MaxEnt
  • P4 Method 4 choose one of four tasks.
  • P5 Presentation
  • P6 Final report
  • Methods 1-3 are supervised methods.
  • Method 4 bagging, boosting, semi-supervised
    learning, or system combination.
  • P1 is an individual task, P2-P6 are group tasks.
  • A group should have no more than three people.
  • Use ESubmit
  • Need to use others code and write your own code.

9
Summary of Ling570
  • Overview corpora, evaluation
  • Tokenization
  • Morphological analysis
  • POS tagging
  • Shallow parsing
  • N-grams and smoothing
  • WSD
  • NE tagging
  • HMM

10
Summary of Ling571
  • Parsing
  • Semantics
  • Discourse
  • Dialogue
  • Natural language generation (NLG)
  • Machine translation (MT)

11
570/571 vs. 572
  • 572 focuses more on statistical approaches.
  • 570/571 are organized by tasks 572 is organized
    by learning methods.
  • I assume that you know
  • The basics of each task POS tagging, parsing,
  • The basic concepts PCFG, entropy,
  • Some learning methods HMM, FSA,

12
An example
  • 570/571
  • POS tagging HMM
  • Parsing PCFG
  • MT Model 1-4 training
  • 572
  • HMM forward-backward algorithm
  • PCFG inside-outside algorithm
  • MT EM algorithm
  • ? All special cases of EM algorithm, one method
    of unsupervised learning.

13
Course layout
  • Supervised methods
  • Decision tree
  • Decision list
  • Transformation-based learning (TBL)
  • Bagging
  • Boosting
  • Maximum Entropy (MaxEnt)

14
Course layout (cont)
  • Semi-supervised methods
  • Self-training
  • Co-training
  • Unsupervised methods
  • EM algorithm
  • Forward-backward algorithm
  • Inside-outside algorithm
  • EM for PM models

15
Outline
  • Course overview
  • Problems and methods
  • Mathematical foundation
  • Probability theory
  • Information theory

16
Problems and methods
17
Types of ML problems
  • Classification problem
  • Estimation problem
  • Clustering
  • Discovery
  • A learning method can be applied to one or more
    types of ML problems.
  • We will focus on the classification problem.

18
Classification problem
  • Given a set of classes and data x, decide which
    class x belongs to.
  • Labeled data
  • (xi, yi) is a set of labeled data.
  • xi is a list of attribute values.
  • yi is a member of a pre-defined set of classes.

19
Examples of classification problem
  • Disambiguation
  • Document classification
  • POS tagging
  • WSD
  • PP attachment given a set of other phrases
  • Segmentation
  • Tokenization / Word segmentation
  • NP Chunking

20
Learning methods
  • Modeling represent the problem as a formula and
    decompose the formula into a function of
    parameters
  • Training stage estimate the parameters
  • Test (decoding) stage find the answer given the
    parameters

21
Modeling
  • Joint vs. conditional models
  • P(data, model)
  • P(model data)
  • P(data model)
  • Decomposition
  • Which variable conditions on which variable?
  • What independent assumptions?

22
An example of different modeling
23
Training
  • Objective functions
  • Maximize likelihood
  • Minimize error rate
  • Maximum entropy
  • .
  • Supervised, semi-supervised, unsupervised
  • Ex Maximize likelihood
  • Supervised simple counting
  • Unsupervised EM

24
Decoding
  • DP algorithm
  • CYK for PCFG
  • Viterbi for HMM
  • Pruning
  • TopN keep topN hyps at each node.
  • Beam keep hyps whose weights gt beam
    max_weight
  • Threshold keep hyps whose weights gt threshold

25
Outline
  • Course overview
  • Problems and methods
  • Mathematical foundation
  • Probability theory
  • Information theory

26
Probability Theory
27
Probability theory
  • Sample space, event, event space
  • Random variable and random vector
  • Conditional probability, joint probability,
    marginal probability (prior)

28
Sample space, event, event space
  • Sample space (O) a collection of basic outcomes.
  • Ex toss a coin twice HH, HT, TH, TT
  • Event an event is a subset of O.
  • Ex HT, TH
  • Event space (2O) the set of all possible events.

29
Random variable
  • The outcome of an experiment need not be a
    number.
  • We often want to represent outcomes as numbers.
  • A random variable is a function that associates a
    unique numerical value with every outcome of an
    experiment.
  • Random variable is a function X O?R.
  • Ex toss a coin once X(H)1, X(T)0

30
Two types of random variable
  • Discrete random variable X takes on only a
    countable number of distinct values.
  • Ex Toss a coin 10 times. X is the number of
    tails that are noted.
  • Continuous random variable X takes on
    uncountable number of possible values.
  • Ex X is the lifetime (in hours) of a light bulb.

31
Probability function
  • The probability function of a discrete variable X
    is a function which gives the probability p(xi)
    that the random variable equals xi a.k.a. p(xi)
    p(Xxi).

32
Random vector
  • Random vector is a finite-dimensional vector of
    random variables XX1,,Xk.
  • P(x) P(x1,x2,,xn)P(X1x1,., Xnxn)
  • Ex P(w1, , wn, t1, , tn)

33
Three types of probability
  • Joint prob P(x,y) prob of x and y happening
    together
  • Conditional prob P(xy) prob of x given a
    specific value of y
  • Marginal prob P(x) prob of x for all possible
    values of y

34
Common equations
35
More general cases
36
Information Theory
37
Information theory
  • It is the use of probability theory to quantify
    and measure information.
  • Basic concepts
  • Entropy
  • Joint entropy and conditional entropy
  • Cross entropy and relative entropy
  • Mutual information and perplexity

38
Entropy
  • Entropy is a measure of the uncertainty
    associated with a distribution.
  • The lower bound on the number of bits it takes to
    transmit messages.
  • An example
  • Display the results of horse races.
  • Goal minimize the number of bits to encode the
    results.

39
An example
  • Uniform distribution pi1/8.
  • Non-uniform distribution (1/2,1/4,1/8, 1/16,
    1/64, 1/64, 1/64, 1/64)

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
40
Entropy of a language
  • The entropy of a language L
  • If we make certain assumptions that the language
    is nice, then the cross entropy can be
    calculated as

41
Joint and conditional entropy
  • Joint entropy
  • Conditional entropy

42
Cross Entropy
  • Entropy
  • Cross Entropy
  • Cross entropy is a distance measure between p(x)
    and q(x) p(x) is the true probability q(x) is
    our estimate of p(x).

43
Cross entropy of a language
  • The cross entropy of a language L
  • If we make certain assumptions that the language
    is nice, then the cross entropy can be
    calculated as

44
Relative Entropy
  • Also called Kullback-Leibler distance
  • Another distance measure between prob functions p
    and q.
  • KL distance is asymmetric (not a true distance)

45
Relative entropy is non-negative
46
Mutual information
  • It measures how much is in common between X and
    Y
  • I(XY)KL(p(x,y)p(x)p(y))

47
Perplexity
  • Perplexity is 2H.
  • Perplexity is the weighted average number of
    choices a random variable has to make.

48
Summary
  • Course overview
  • Problems and methods
  • Mathematical foundation
  • Probability theory
  • Information theory
  • ? MS Ch2

49
Next time
  • FSA
  • HMM MS Ch 9.1 and 9.2
Write a Comment
User Comments (0)
About PowerShow.com