Cognate or False Friend Ask the Web - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Cognate or False Friend Ask the Web

Description:

Svetlin Nakov, Sofia University 'St. Kliment Ohridski' ... muff. gratis. ?????. 1. R_at_r. P_at_r. Cogn.? Sim. RU Sense. BG Sense. Candidate. r ... – PowerPoint PPT presentation

Number of Views:165

Avg rating:3.0/5.0

Slides: 29

Provided by: Svetli6

Category:

more less

Transcript and Presenter's Notes

Title: Cognate or False Friend Ask the Web

1
Cognate or False Friend? Ask the Web!
A Workshop on Acquisition and Management of
Multilingual Lexicons

Svetlin Nakov, Sofia University "St. Kliment
Ohridski"
Preslav Nakov, University of California, Berkeley
Elena Paskaleva, Bulgarian Academy of Sciences

2
Introduction

Cognates and false friends
Cognates are pair of words in different languages
that sound similar and are translations of each
other
False friends are pairs of words in two languages
that sound similar but differ in their meanings
The problem
Design an algorithm that can distinguish between
cognates and false friends

3
Cognates and False Friends

Examples of cognates
??? in Bulgarian ???? in Russian (day)
idea in English ???? in Bulgarian (idea)
Examples of false friends
????? in Bulgarian (mother) ? ????? in Russian
(vest)
prost in German (cheers) ? ????? in Bulgarian
(stupid)
gift in German (poison) ? gift in English
(present)

4
The Paper in One Slide

Measuring semantic similarity
Analyze the words local contexts
Use the Web as a corpus
Similarities contexts ? similar words
Context translation ? cross-lingual similarity
Evaluation
200 pairs of words
100 cognates and 100 false friends
11pt average precision 95.84

5
Contextual Web Similarity

What is local context?
Few words before and after the target word
The words in the local context of given word are
semantically related to it
Need to exclude the stop words prepositions,
pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and
unique gift baskets from our online boutique.
Flower delivery online by local florists for
birthday flowers.
6
Contextual Web Similarity

Web as a corpus
The Web can be used as a corpus to extract the
local context for given word
The Web is the largest possible corpus
Contains big corpora in any language
Searching some word in Google can return up to 1
000 excerpts of texts
The target word is given along with its local
context few words before and after it
Target language can be specified

7
Contextual Web Similarity

Web as a corpus
Example Google query for "flower"

8
Contextual Web Similarity

Measuring semantic similarity
For given two words their local contexts are
extracted from the Web
A set of words and their frequencies
Semantic similarity is measured as similarity
between these local contexts
Local contexts are represented as frequency
vectors for given set of words
Cosine between the frequency vectors in the
Euclidean space is calculated

9
Contextual Web Similarity

Example of context words frequencies

word flower
word computer
10
Contextual Web Similarity

Example of frequency vectors
Similarity cosine(v1, v2)

v1 flower
v2 computer
11
Cross-Lingual Similarity

We are given two words in different languages L1
and L2
We have a bilingual glossary G of translation
pairs p ? L1, q ? L2
Measuring cross-lingual similarity
We extract the local contexts of the target words
from the Web C1 ? L1 and C2 ? L2
We translate the context
We measure distance between C1 and C2

12
Reverse Context Lookup

Local context extracted from the Web can contain
arbitrary parasite words like "online", "home",
"search", "click", etc.
Internet terms appear in any Web page
Such words are not likely to be associated with
the target word
Example (for the word flowers)
"send flowers online", "flowers here", "order
flowers here"
Will the word "flowers" appear in the local
context of "send", "online" and "here"?

13
Reverse Context Lookup

If two words are semantically related both should
appear in the local contexts of each other
Let x,y number of occurrences of x in the
local context of y
For any word w and a word from its local context
wc, we define their strength of semantic
association p(w,wc) as follows
p(w, wc) min (w, wc), (wc,w)
We use p(w,wc) as vector coordinates when
measuring semantic similarity

14
Web Similarity Using Seed Words

Adaptation of the FungYee'98 algorithm
We have a bilingual glossary G L1 ? L2 of
translation pairs and target words w1, w2
We search in Google the co-occurrences of the
target words with the glossary entries
Compare the co-occurrence vectors
for each p,q ? G compare
max (google("w1 p") and google("p w1"))
with
max (google"w2 q") and google("q w2"))

P. Fung and L. Y. Yee. An IR approach for
translating from nonparallel, comparable texts.
In Proceedings of ACL, volume 1, pages 414420,
1998
15
Evaluation Data Set

We use 200 Bulgarian/Russian pairs of words
100 cognates and 100 false friends
Manually assembled by a linguist
Manually checked in several large monolingual and
bilingual dictionaries
Limited to nouns only

16
Experiments

We tested few modifications of our contextual Web
similarity algorithm
Use of TF.IDF weighting
Preserve the stop words
Use of lemmatization of the context words
Use different context size (2, 3, 4 and 5)
Use small and large bilingual glossary
Compared it with the seed words algorithm
Compared with traditional orthographic similarity
measures LCSR and MEDR

17
Experiments

BASELINE random
MEDR minimum edit distance ratio
LCSR longest common subsequence ration
SEED the "seed words" algorithm
WEB3 the Web-based similarity algorithm with the
default parameters context size 3, small
glossary, stop words filtering, no lemmatization,
no reverse context lookup, no TF.IDF-weighting
NO-STOP WEB3 without stop words removal
WEB1, WEB2, WEB4 and WEB5 WEB3 with context size
of 1, 2, 4 and 5
LEMMA WEB3 with lemmatization
HUGEDICT WEB3 with the huge glossary
REVERSE the "reverse context lookup" algorithm
COMBINED WEB3 lemmatization huge glossary
reverse context lookup

18
Resources

We used the following resources
Bilingual Bulgarian / Russian glossary 3 794
pairs of translation words
Huge bilingual glossary 59 583 word pairs
A list of 599 Bulgarian stop words
A list of 508 Russian stop words
Bulgarian lemma dictionary 1 000 000 wordforms
and 70 000 lemmata
Russian lemma dictionary 1 500 000 wordforms and
100 000 lemmata

19
Evaluation

We order the pairs of words from the testing
dataset by the calculated similarity
False friends are expected to appear on the top
and the cognates on the bottom
We evaluate the 11pt average precision of the
obtained ordering

20
Results (11pt Average Precision)
Comparing BASELINE, LCSR, MEDR, SEED and WEB3
algorithms
21
Results (11pt Average Precision)
Comparing different context sizes keeping the
stop words
22
Results (11pt Average Precision)
Comparing different improvements of the WEB3
algorithm
23
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of
evaluated algorithms
24
Results The Ordering for WEB3
25
Discussion

Our approach is original because
Introduces semantic similarity measure
Not orthographic or phonetic
Uses the Web as a corpus
Does not rely on any preexisting corpora
Uses reverse-context lookup
Significant improvement in quality
Is applied to original problem
Classification of almost identically spelled
true/false friends

26
Discussion

Very good accuracy over 95
It is not 100 accurate
Typical mistakes are synonyms, hyponyms, words
influenced by cultural, historical and
geographical differences
The Web as a corpus introduces noise
Google returns the first 1 000 results only
Google ranks higher news portals, travel agencies
and retail sites than books, articles and forums
posts
Local context could contains noise

27
Conclusion and Future Work