Language Identification from Text Using Cumulative Frequency Addition - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Language Identification from Text Using Cumulative Frequency Addition

Description:

Dictionary Based Approach. Machine Learning (ML) Approach. ML Approach to Language Identification ... Unique Word Endings (i.e. 'cchi' in Italian, 'vnd' in Dutch) ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 10
Provided by: bashir3
Learn more at: http://csis.pace.edu
Category:

less

Transcript and Presenter's Notes

Title: Language Identification from Text Using Cumulative Frequency Addition


1
Language Identification from Text Using
Cumulative Frequency Addition
  • Bashir Ahmed
  • Student/Faculty Research Day
  • Pace University
  • May 07, 2004

2
Problem Statement
  • Existing Text-to-Speech (TTS) systems fail to
    recognize foreign words in written Text. As a
    result, they try to sound foreign lingual words
    that are embedded in native text using native
    lexicon. This causes poor TTS conversion, and
    unpleasant sounding of foreign words with garbled
    meaning. To be really useful in production
    environment, a great deal of improvement is
    necessary in the current TTS technology such as
    recognition of foreign words and their proper
    sounding using the correct lexicon. The proposal
    in this thesis is to investigate a solution to
    detect language shift in written text which can
    be useful for TTS modules to switch to the
    proper lexicon.

http//www.naturalvoices.att.com/demos/
3
What Do We Need?
  • To be able to detect language shift in a written
    document, we must be able to detect the major
    language first. So, we need a Language
    Identifier.
  • The Language Identifier must be efficient this
    is my focus at this early stage of my research.
  • Existing language Identifiers such as Ngram based
    rank-order statistics and Naïve Bayesian
    classifiers work well, but they have their pros
    and cons.

4
Approaches to Language Identification
  • Dictionary Based Approach
  • Machine Learning (ML) Approach

5
ML Approach to Language Identification
  • Feature Extraction From Training Samples
  • Ngram Based Approach (i.e. The, Th, e, he
    etc)
  • Unique Word Endings (i.e. cchi in Italian,
    vnd in Dutch)
  • Grammatical Words / Common Short Words (the in
    English, es in French)
  • ASCII Feature Vector
  • Classification Using Proven Algorithm
  • Ngram Rank-Order Statistics
  • Bayesian Decision Rules - Naïve Bayesian
    Classifiers
  • Our New Classifier Cumulative Frequency Addition

6
Ngram Based Rank-Order Statistics
Most Frequent TH TH 0
ER ING 3
ON ON 0
LE ER 2
ING AND 1
Least Frequent AND ED No-match max distance
  • Pros
  • Insensitive to typographical errors
  • Works better than other classifiers when dealing
    with shorter strings
  • Cons
  • Relatively Slower due to counting and sorting of
    Ngrams

7
Bayesian Decision Rules
  • Given the task of deciding which of two possible
    phenomena have caused a particular observation,
    we can minimize our probability of error by
    computing which is most likely to have given rise
    to the observation. Given observation X, if we
    were to choose between A and B, we can use Bayes
    Theorem
  • P(A,X) p(AX) p(X), Since p(X) 1, P(A,X)
    p(AX) p(X/A) p(A)
  • Similarly P(B,X) p(X/B) p(B)
  • P(A), P(B) are prior probabilities,
  • p(XB) is the likelihood of X belonging to B.
  • Calculating the likelihood is not so trivial.
    However, the calculation can be simplified by
    assuming that component probabilities are
    independent of each other and thus, they can be
    multiplied to get the likelihood. This is called
    Naïve Bayesian Assumption and used successfully
    in Language Classification.

For Language classification, you simply multiply
the probabilities of each matching Ngrams in the
training database and select the language that
produces the highest result.
8
Cumulative Frequency Addition
  • Add the frequencies of all matching Ngrams in the
    training database and decide based on which
    language have given rise to the highest weight.
  • Pros
  • Extremely Simple
  • Efficient
  • More accurate than NBC with smaller string
  • Cons
  • Needs a large training set
  • Performance may not be as good with smaller
    training set

Demo
9
Summary
  • Ngram Based language identification can be useful
    even when there are typographical errors in the
    text.
  • Even though Rank-order statistical method is
    accurate with smaller strings, it may not be so
    useful when the analysis requires large string
    processing.
  • Naïve Bayesian Classifiers are efficient, but
    they require large test string to get correct
    classification.
  • Cumulative frequency addition is simple, accurate
    and efficient, but it may not work well with
    smaller training set. Further investigation is
    necessary!
Write a Comment
User Comments (0)
About PowerShow.com