Lexical Resource - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Lexical Resource

Description:

Language/Language Pair: Bengali to Hindi, English to Bengali ... Grapheme-to-Phoneme mapper. Evaluation metrics: Usability. Coverage ... – PowerPoint PPT presentation

Number of Views:515
Avg rating:3.0/5.0
Slides: 17
Provided by: Spe553
Category:

less

Transcript and Presenter's Notes

Title: Lexical Resource


1
Lexical Resource
IIT Kharagpur
2
Resources - MT
  • Title Machine Translation among Indian
    Languages, IL to English
  • Proposer Anupam Basu, Sudeshna Sarkar, Pabitra
    Mitra
  • Institution Indian Institute of Technology
    Kharagpur
  • Language/Language Pair Bengali to Hindi, English
    to Bengali
  • Name of the lexical resources that will be built
  • Transfer Grammar Rules (Bengali to Hindi)
  • Bilingual Dictionary (Bengali to Hindi)
  • Annotated Corpus ( Bengali)
  • Domain of the lexical resource News ??

CEL, IIT Kharagpur
3
Transfer Grammar Rules (Bengali to Hindi)
  • Language Pair Bengali - Hindi
  • Name of the Lexical Resources
  • Transfer Grammar Rules (Bengali to Hindi)
  • Final Size of the Lexical Resource
  • 100 to 200 Rules
  • Average size of such Resources in Other
    Languages
  • Estimation of the Expected Size ( Pert Chart)
  • Evaluation Metrics
  • Overall translation quality ( as judged by human)
  • Completeness ( Lexical, Grammatical, Mapping Rule
    )
  • Correctness ( Lexical, Syntactical, Semantic )
  • Stylistics ( Lexical, Syntactic, Usage )

CEL, IIT Kharagpur
4
Bilingual Dictionary (Bengali to Hindi)
  • Language Pair Bengali and Hindi
  • Name of the Lexical Resources
  • Bilingual Dictionary (Bengali to Hindi)
  • Final Size of the Lexical Resources
  • 30,000 root words
  • Average size of such Resources in Other
    Languages
  • Usually 20,000 to 50,000 root words
  • Estimation of the Expected Size ( Pert Chart)

3 month
0
10K
20K
30K
3 month
3 month
  • Evaluation Metrics
  • Overall translation quality ( as judged by human)
  • Coverage
  • Correctness

CEL, IIT Kharagpur
5
Annotated Corpus (Bengali)
  • Language Bengali
  • Name of the Lexical Resources
  • Part-of-Speech, Chunk and Named Entity tagged
    corpus
  • Final Size of the Lexical Resources
  • 200,000 words
  • Average size of such Resources in Other
    Languages
  • Usually 200,000 to 300,000 words
  • Estimation of the Expected Size ( Pert Chart)

6 months
6 months
6 months
6 months
  • Evaluation Metrics
  • Correctness
  • Coverage

CEL, IIT Kharagpur
6
IIT Kharagpur
Components
  • Project proposal
  • Machine Translation

7
Components
  • Title Machine Translation among Indian
    languages, English to Bengali
  • Proposer Anupam Basu, Pabitra Mitra, Sudeshna
    Sarkar
  • Institution IIT Kharagpur
  • Language Bengali
  • Name of Components that will be implemented
  • Morphological Synthesis Engine
  • Annotation Standards
  • Components for Bengali
  • POS Tagger
  • Named entity Recognizer
  • Local Word Grouper (chunker)
  • Morphological Analyzer
  • Word Generator
  • Sentence Generator
  • Proposed Domain
  • News??

8
Morphological Synthesis Engine
  • Language Generic engine (Horizontal)
  • Name of Component Morphological Synthesis Engine
  • Techniques Used
  • Using Combination of Language Specific rules and
    Paradigm Tables
  • Evaluation metrics
  • Usability
  • Coverage

9
Annotation Standard
  • Language All
  • Annotation Standards For
  • Part-of-Speech, Chunking and Named Entity tags
  • Final Size of the Tag sets
  • Part-of-Speech 24 40, Chunking 5 20, Named
    Entity 10 - 30
  • Average size of such Tag Sets in Other Languages
  • Same as above
  • Estimation of the Expected Size ( Pert Chart)

All the standard tag sets are to be designed
within first 2 months
  • Evaluation Metrics
  • Usability
  • Coverage

CEL, IIT Kharagpur
10
POS Tagger for Bengali
  • Language Bengali
  • Name of Component Bengali Part-of-Speech Tagger
  • Techniques Used
  • Bi-gram Hidden Markov Model
  • Semi-Supervised Learning.
  • Morphology driven transformation based learning
    for unknown word handling.
  • Performance of Techniques in other Languages
  • Bi-gram Hidden Markov Model 97-98 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • Sentence/word level Accuracy.
  • Known/Unknown word Accuracy

11
Named Entity recognizer for Bengali
  • Language Bengali
  • Name of Component Bengali Named-entity
    recognizer
  • Techniques Used
  • Maximum Entropy model, Conditional Random Field
  • Performance of Techniques in other Languages
  • Precision 90-95 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • Precision and Recall
  • F-measure

T3 month
T3 month
T3 month
T3 month
60
70
75
80
T 6 months
90
12
Bengali Local Word Grouper
  • Language Bengali
  • Name of Component Bengali Local Word Grouper
  • Techniques Used
  • Feature Structure Unification using greedy
    Algorithm
  • Statistical Chunking
  • MWE handling
  • Performance of Techniques in other Languages
  • LWG accuracy 90-95 for English
  • Estimate of expected Performance (PERT Chart)
  • Evaluation metrics
  • F-Score Harmonic mean of Precision and Recall

13
Morphological Analyzer for Bengali
  • Language Bengali
  • Name of Component Bengali Morphological Analyzer
  • Techniques Used
  • Backward traversal of a word along a DAG
    structure
  • Performance of Techniques in other Languages
  • Morphological Analyzer Accuracy 97-98 for
    English
  • Estimate of expected Coverage (PERT Chart)
  • Evaluation metrics
  • Coverage
  • Correctness

T3 month
T3 month
T3 month
0
90
95
97
14
Word Generator for Bengali
  • Language Bengali
  • Name of Component Bengali Word Generator
  • Techniques Used
  • Using Combination of rules, Paradigm Tables.
  • Performance of Techniques in other Languages
  • NA
  • Estimate of expected Coverage (Pert Chart)
  • Evaluation metrics
  • Understandability
  • Quality
  • Completeness

15
Transliteration (Hindi Bengali)
  • Language Bengali to Hindi, Hindi to Bengali
  • Name of Component Transliteration
  • Techniques Used
  • Character Trigram Substitution
  • Grapheme-to-Phoneme mapper
  • Evaluation metrics
  • Usability
  • Coverage

16
Sentence Generator for Bengali
  • Language Bengali
  • Name of Component Sentence Generator for Bengali
  • Techniques Used
  • Grammar Rule Based
  • Evaluation metrics
  • Translation Quality (As judged by native speaker)
  • Understandability
  • Stylistics
Write a Comment
User Comments (0)
About PowerShow.com