Cognate or False Friend Ask the Web - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Cognate or False Friend Ask the Web

Description:

Svetlin Nakov, Sofia University 'St. Kliment Ohridski' ... muff. gratis. ?????. 1. R_at_r. P_at_r. Cogn.? Sim. RU Sense. BG Sense. Candidate. r ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 29
Provided by: Svetli6
Category:
Tags: ask | cognate | false | friend | muff | web

less

Transcript and Presenter's Notes

Title: Cognate or False Friend Ask the Web


1
Cognate or False Friend? Ask the Web!
A Workshop on Acquisition and Management of
Multilingual Lexicons
  • Svetlin Nakov, Sofia University "St. Kliment
    Ohridski"
  • Preslav Nakov, University of California, Berkeley
  • Elena Paskaleva, Bulgarian Academy of Sciences

2
Introduction
  • Cognates and false friends
  • Cognates are pair of words in different languages
    that sound similar and are translations of each
    other
  • False friends are pairs of words in two languages
    that sound similar but differ in their meanings
  • The problem
  • Design an algorithm that can distinguish between
    cognates and false friends

3
Cognates and False Friends
  • Examples of cognates
  • ??? in Bulgarian ???? in Russian (day)
  • idea in English ???? in Bulgarian (idea)
  • Examples of false friends
  • ????? in Bulgarian (mother) ? ????? in Russian
    (vest)
  • prost in German (cheers) ? ????? in Bulgarian
    (stupid)
  • gift in German (poison) ? gift in English
    (present)

4
The Paper in One Slide
  • Measuring semantic similarity
  • Analyze the words local contexts
  • Use the Web as a corpus
  • Similarities contexts ? similar words
  • Context translation ? cross-lingual similarity
  • Evaluation
  • 200 pairs of words
  • 100 cognates and 100 false friends
  • 11pt average precision 95.84

5
Contextual Web Similarity
  • What is local context?
  • Few words before and after the target word
  • The words in the local context of given word are
    semantically related to it
  • Need to exclude the stop words prepositions,
    pronouns, conjunctions, etc.
  • Stop words appear in all contexts
  • Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and
unique gift baskets from our online boutique.
Flower delivery online by local florists for
birthday flowers.
6
Contextual Web Similarity
  • Web as a corpus
  • The Web can be used as a corpus to extract the
    local context for given word
  • The Web is the largest possible corpus
  • Contains big corpora in any language
  • Searching some word in Google can return up to 1
    000 excerpts of texts
  • The target word is given along with its local
    context few words before and after it
  • Target language can be specified

7
Contextual Web Similarity
  • Web as a corpus
  • Example Google query for "flower"

8
Contextual Web Similarity
  • Measuring semantic similarity
  • For given two words their local contexts are
    extracted from the Web
  • A set of words and their frequencies
  • Semantic similarity is measured as similarity
    between these local contexts
  • Local contexts are represented as frequency
    vectors for given set of words
  • Cosine between the frequency vectors in the
    Euclidean space is calculated

9
Contextual Web Similarity
  • Example of context words frequencies

word flower
word computer
10
Contextual Web Similarity
  • Example of frequency vectors
  • Similarity cosine(v1, v2)

v1 flower
v2 computer
11
Cross-Lingual Similarity
  • We are given two words in different languages L1
    and L2
  • We have a bilingual glossary G of translation
    pairs p ? L1, q ? L2
  • Measuring cross-lingual similarity
  • We extract the local contexts of the target words
    from the Web C1 ? L1 and C2 ? L2
  • We translate the context
  • We measure distance between C1 and C2

12
Reverse Context Lookup
  • Local context extracted from the Web can contain
    arbitrary parasite words like "online", "home",
    "search", "click", etc.
  • Internet terms appear in any Web page
  • Such words are not likely to be associated with
    the target word
  • Example (for the word flowers)
  • "send flowers online", "flowers here", "order
    flowers here"
  • Will the word "flowers" appear in the local
    context of "send", "online" and "here"?

13
Reverse Context Lookup
  • If two words are semantically related both should
    appear in the local contexts of each other
  • Let x,y number of occurrences of x in the
    local context of y
  • For any word w and a word from its local context
    wc, we define their strength of semantic
    association p(w,wc) as follows
  • p(w, wc) min (w, wc), (wc,w)
  • We use p(w,wc) as vector coordinates when
    measuring semantic similarity

14
Web Similarity Using Seed Words
  • Adaptation of the FungYee'98 algorithm
  • We have a bilingual glossary G L1 ? L2 of
    translation pairs and target words w1, w2
  • We search in Google the co-occurrences of the
    target words with the glossary entries
  • Compare the co-occurrence vectors
  • for each p,q ? G compare
  • max (google("w1 p") and google("p w1"))
  • with
  • max (google"w2 q") and google("q w2"))

P. Fung and L. Y. Yee. An IR approach for
translating from nonparallel, comparable texts.
In Proceedings of ACL, volume 1, pages 414420,
1998
15
Evaluation Data Set
  • We use 200 Bulgarian/Russian pairs of words
  • 100 cognates and 100 false friends
  • Manually assembled by a linguist
  • Manually checked in several large monolingual and
    bilingual dictionaries
  • Limited to nouns only

16
Experiments
  • We tested few modifications of our contextual Web
    similarity algorithm
  • Use of TF.IDF weighting
  • Preserve the stop words
  • Use of lemmatization of the context words
  • Use different context size (2, 3, 4 and 5)
  • Use small and large bilingual glossary
  • Compared it with the seed words algorithm
  • Compared with traditional orthographic similarity
    measures LCSR and MEDR

17
Experiments
  • BASELINE random
  • MEDR minimum edit distance ratio
  • LCSR longest common subsequence ration
  • SEED the "seed words" algorithm
  • WEB3 the Web-based similarity algorithm with the
    default parameters context size 3, small
    glossary, stop words filtering, no lemmatization,
    no reverse context lookup, no TF.IDF-weighting
  • NO-STOP WEB3 without stop words removal
  • WEB1, WEB2, WEB4 and WEB5 WEB3 with context size
    of 1, 2, 4 and 5
  • LEMMA WEB3 with lemmatization
  • HUGEDICT WEB3 with the huge glossary
  • REVERSE the "reverse context lookup" algorithm
  • COMBINED WEB3 lemmatization huge glossary
    reverse context lookup

18
Resources
  • We used the following resources
  • Bilingual Bulgarian / Russian glossary 3 794
    pairs of translation words
  • Huge bilingual glossary 59 583 word pairs
  • A list of 599 Bulgarian stop words
  • A list of 508 Russian stop words
  • Bulgarian lemma dictionary 1 000 000 wordforms
    and 70 000 lemmata
  • Russian lemma dictionary 1 500 000 wordforms and
    100 000 lemmata

19
Evaluation
  • We order the pairs of words from the testing
    dataset by the calculated similarity
  • False friends are expected to appear on the top
    and the cognates on the bottom
  • We evaluate the 11pt average precision of the
    obtained ordering

20
Results (11pt Average Precision)
Comparing BASELINE, LCSR, MEDR, SEED and WEB3
algorithms
21
Results (11pt Average Precision)
Comparing different context sizes keeping the
stop words
22
Results (11pt Average Precision)
Comparing different improvements of the WEB3
algorithm
23
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of
evaluated algorithms
24
Results The Ordering for WEB3
25
Discussion
  • Our approach is original because
  • Introduces semantic similarity measure
  • Not orthographic or phonetic
  • Uses the Web as a corpus
  • Does not rely on any preexisting corpora
  • Uses reverse-context lookup
  • Significant improvement in quality
  • Is applied to original problem
  • Classification of almost identically spelled
    true/false friends

26
Discussion
  • Very good accuracy over 95
  • It is not 100 accurate
  • Typical mistakes are synonyms, hyponyms, words
    influenced by cultural, historical and
    geographical differences
  • The Web as a corpus introduces noise
  • Google returns the first 1 000 results only
  • Google ranks higher news portals, travel agencies
    and retail sites than books, articles and forums
    posts
  • Local context could contains noise

27
Conclusion and Future Work
  • Conclusion
  • Algorithm that can distinguish between cognates
    and false friends
  • Analyzes words local contexts, using the Web as a
    corpus
  • Future Work
  • Better glossaries
  • Automatic augmenting the glossary
  • Different language pairs

28
Questions?
Cognate or FalseFriend? Ask the Web!
Write a Comment
User Comments (0)
About PowerShow.com