On Ambiguity in Internet Searches - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

On Ambiguity in Internet Searches

Description:

... for the same word (e.g. pokemon, Pokemon, pok mon, pok mon, Pok mon) ... Several grammatical forms may be equally interesting: hopper (jumper) noun singular ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 39
Provided by: Ans72
Category:

less

Transcript and Presenter's Notes

Title: On Ambiguity in Internet Searches


1
On Ambiguity in Internet Searches
  • A collaboration between the University of Oslo
    and Fast Search and Transfer
  • Conducted by Tekstlaboratoriet, University of
    Oslo
  • July and August 2000

2
The People Involved
  • Gordana Ilic Holen
  • Janne von Koss Torkildsen
  • Janne Bondi Johannessen
  • g.i.holen, janneto, jannebj_at_mail.hf.uio.no

3
Prospect
  • To what extent does search on the Internet give
    irrelevant information?
  • To what extent is this due to semantically
    ambiguous search words?
  • Does semantically ambiguous search words coincide
    with difference in grammatical categories?
  • If so, could a disambiguating tagger solve
    ambiguity problems?

4
Method
  • Studied log files from search words in Fast's
    search engine
  • Decided whether the word were ambiguous or not
  • If a search word were ambiguous, its meanings
    were sorted according to several different
    criteria
  • Reconstructed the searches

5
Results
  • Almost one forth (23.16 ) of the most frequent
    search words were ambiguous
  • In 90 of these cases there was a correlation
    between meaning and grammatical category

6
Material The Log Files
  • 900 000 search words from Fast Search and
    Transfer's log files coming from several
    different search engines
  • The words were sorted by frequency, and identical
    words forms put together
  • The 5500 most frequent words were picked for
    further investigation
  •  This way we made sure to work on only common
    search words

7
Material Number of Lexemes
  • The number of lexemes is much smaller than
  • the number of search words because
  • the files distinguish between capital and
    non-capital letters (e.g. Liv and liv)
  • In several cases many different spellings are
    used for the same word (e.g. pokemon, Pokemon,
    pokèmon, pokémon, Pokèmon)

8
Most and Least Frequent Search Words
  • The most frequent content words sex and chat (52
    000 and 36 000 searches)
  • The least frequent of the 5500 words hyttetomter
    and a.s, (275 and 210 searches)

9
An Example Per var høy.
  • "ltPergt"
  • "per" subst mask prop (egennavn)
  • "per" prep ( pr.)
  • "ltvargt
  • "var" adj pos m/f ub ent ( følsom)
  • "var" subst nøyt appell ent ub ( et
    putevar)
  • "var" subst nøyt appell fl ub (
    flere putevar)
  • "vare" verb imp ( ikke slutt!)
  • "være" verb pret ( hadde
    egenskapen)
  • "lthøygt"
  • "høy" adj pos m/f ub ent ( ikke lav)
  • "høy" subst nøyt appell ent ub ( gress)

10
How to Find Out About Ambiguity?
  • We checked each word against the Internet edition
    of Multitaggeren
  • It provides information on grammatical properties
    of inflected words (e.g. bilde (picture), but
    also bilder (pictures))
  • It does not provide any information on the words
    meaning

11
Ambiguity as Homonymy, not Polysemy
  • Relevantputevar and følsom meanings of var
  • Irrelevant the-lowest-part-of-a-leg and
  • the-lowest-part-of-a-mountain meanings of fot
    (foot))

12
The Work Process
  • For each search word we
  • Checked the word for ambiguity
  • Investigated what kind of ambiguity was involved
  • Investigated how grammatical categories and
    properties correlated with the different meanings
  • Searched on the Internet using Fasts search
    engine Alltheweb.com and registered the
    distribution of the different meanings
  • Filled in a form

13
The Form
14
An Example of a Completed Form
15
Filling In the Form The Header
  • number the absolute number of searches for that
    particular word which are registered in the log
    files (e.g. 450 searches for the word rose
    (rose))
  • norsk Norwegian was the chosen language for
    search pages

16
Filling In the Form Ambiguity
  • The search word was checked up against
    Multitaggeren. If there was more than one entry,
    ja (yes) would be written in the Ambiguity-field
    in the form
  • e.g. word lærer would reveal two meanings
  • - subst appell mask ent ub
  • - verb pres

17
Filling In the Form Ambiguity
  • For each ambiguous word, its grammatical
    categories were written down together with the
    meanings associated with each grammatical
    category
  • The percentage of the search results for each
    specific meaning was also registered.  

18
Filling In the Form Important for Search?
  • Not all types of ambiguity have been considered
    important
  • For instance, word Lærer is ambiguous because it
    means both a person and an action, but the
    meanings are closely related
  • In such cases the word nei (no) was written in
    this field
  •  Otherwise, the word ja (yes) was filled in

19
Filling In the Form Inflected Form
  • If the search word were an inflected form, we
    filled this field with information on the words
    grammatical properties
  • bilder (pictures) ub fl
  • høyskolen (the college)
  • be ent
  • jenter (girls) ub fl
  • sitater (quotes) ub fl
  • venner (friends) ub fl

20
Filling In the Form Name
21
Filling In the Form Note
  • Note ok
  • When the word was not found to be ambiguous the
    comment ok was written in the Note-field
  • Examples
  • ringetone (calling signal)
  • matematikk (mathematics)
  • bunad (national costume)
  • epostleser (e-mail reader)
  • mobiltelefon (cellular phone)

22
Filling In the Form Note Note irr
(irrelevant)
  • words in a foreign language f.ex. you, car,
    history, wars, tjejer
  • numbers 1981, 10
  • single letters (except e for ecstasy)
  • words which contain signs which are not letters
    or numbers, e.g. oslo, .no

23
Filling In the Form Note Note poly (Polysemy)
  • Polysemic ambiguity was not accounted for in this
    investigation
  • Nevertheless, the polysemic meanings of a search
    word, were listed in this field

24
Some of the Results in Numbers
25
Concrete suggestions which involve grammatical
tagging
  • Grammatical tagging of words
  • Name recognition

26
Concrete suggestions which involve grammatical
taggingGrammatical Tagging of Words
  • Tagging would solve some ambiguities
  • Kort (card) noun (56 hits) vs.
  • Kort (short) adjective (32 hits)
  • But some are beyond reach of a word tagger
  • Kort (playing card) noun vs.
  • Kort (credit card) noun

27
Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes
  • I. Check each search criterion against a
    full-form dictionary
  • hopper (noun, singular) (ski jumper)
  • hopper (noun, plural) (mares)

28
Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes 1
  • Provide one of the following
  • A dialogue box for grammatical categories
  • Would you like to search for a singular noun
  • (S), or a plural noun (P)?
  • Drawback Many users are not familiar with
  • grammatical terminology

29
Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes 2
  • A dialogue box for word meanings
  • Do you want to find out about one who takes part
  • in ski-jumping (1) or about the female of any
  • equine animal (as the horse, ass, or zebra)
    (2)?
  • Drawback This requires a link to a dictionary
    of definitions

30
Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog boxes 3
  • A dialogue box based on sorting of hits
  • Do you want hits of type 1 or type 2?
  • 1. Russisk hopper utestengt to Ã¥r, Oslo
  • Den russiske hopperen Dimitri Vassiljev er
    utestengt for to år
  • etter Ã¥ ha testet positiv pÃ¥ en dopingtest
    www.api.no/0/04/78/1.html
  • 2. Register for hele Norge og Sverige med
    opplysninger om
  • drektige hopper i samtlige hest- og ponniraser
    w1.2671.telia.com/u267102940/folljour.html

31
Grammatical Tagging of WordsGrammatical Tagging
Combined with Sorting of Hits
  • Avoid interactions with the user, sort the hits
    after tagging
  • AV (adjective, audiovisual) 5 of the hits
  • av (preposition, of). 95 of hits
  •  
  • Show content words before function words!

32
Grammatical Tagging of WordsGrammatical Tagging
Combined with Sorting of Hits
  • Drawbacks
  • Unpredictable interest
  • kort (short/ card) adjective or noun?
  • Several grammatical forms may be equally
    interesting
  • hopper (jumper) noun singular
  • (jumps) verb present

33
Grammatical Tagging of WordsGrammatical Tagging
Used to Find all Inflected Forms
  • Internet users tend to choose only one inflected
    form of a word as a search criterion, e.g. rose
    (rose) or roser (roses)
  •  Consequence They miss relevant hits
  •  

34
Grammatical Tagging of WordsGrammatical Tagging
Used to Find all Inflected Forms
  • Solution
  • Find all forms of a relevant search word!
  • Use a grammatical tagger rather than just partial
    matching, to get suppletive forms, e.g. gås
    (goose), gåsa (the goose), gjess (geese), and
    gjessene (the geese).

35
Concrete suggestions which involve grammatical
taggingName Recognition
  • The search engines do not usually distinguish
    between upper and lower case letters.
  • Consequence They fail to distinguish between
    proper names and appellatives, for instance, Dag
    (male name) 27 of the hits
  • dag (day) 65 of the hits
  • Solution distinguish upper and lower case!

36
Concrete suggestions which involve grammatical
taggingName Recognition
  • Drawbacks
  • Some proper names are ambiguous
  • Java
  • programming language, 99 of the hits
  • the island, 1 of the hits

37
Conclusion
  • Every fourth word used on the Fast Search Engine
    is ambiguous
  • Almost 15 of all search words were ambiguous
    with words that do not have the same semantic
    origin.

38
Conclusion
  • It would be of great help to have search engines
    which enable users to choose the preferred
    meaning of ambiguous words. Using a grammatical
    tagger would be an important step in this
    direction.
Write a Comment
User Comments (0)
About PowerShow.com