Title: On Ambiguity in Internet Searches
1On Ambiguity in Internet Searches
- A collaboration between the University of Oslo
and Fast Search and Transfer - Conducted by Tekstlaboratoriet, University of
Oslo - July and August 2000
2The People Involved
- Gordana Ilic Holen
- Janne von Koss Torkildsen
- Janne Bondi Johannessen
- g.i.holen, janneto, jannebj_at_mail.hf.uio.no
3Prospect
- To what extent does search on the Internet give
irrelevant information? - To what extent is this due to semantically
ambiguous search words? - Does semantically ambiguous search words coincide
with difference in grammatical categories? - If so, could a disambiguating tagger solve
ambiguity problems?
4Method
- Studied log files from search words in Fast's
search engine - Decided whether the word were ambiguous or not
- If a search word were ambiguous, its meanings
were sorted according to several different
criteria - Reconstructed the searches
5Results
- Almost one forth (23.16 ) of the most frequent
search words were ambiguous - In 90 of these cases there was a correlation
between meaning and grammatical category
6Material The Log Files
- 900 000 search words from Fast Search and
Transfer's log files coming from several
different search engines - The words were sorted by frequency, and identical
words forms put together - The 5500 most frequent words were picked for
further investigation - Â This way we made sure to work on only common
search words
7Material Number of Lexemes
- The number of lexemes is much smaller than
- the number of search words because
- the files distinguish between capital and
non-capital letters (e.g. Liv and liv) - In several cases many different spellings are
used for the same word (e.g. pokemon, Pokemon,
pokèmon, pokémon, Pokèmon)
8Most and Least Frequent Search Words
- The most frequent content words sex and chat (52
000 and 36 000 searches) - The least frequent of the 5500 words hyttetomter
and a.s, (275 and 210 searches)
9An Example Per var høy.
- "ltPergt"
- "per" subst mask prop (egennavn)
- "per" prep ( pr.)
- "ltvargt
- "var" adj pos m/f ub ent ( følsom)
- "var" subst nøyt appell ent ub ( et
putevar) - "var" subst nøyt appell fl ub (
flere putevar) - "vare" verb imp ( ikke slutt!)
- "være" verb pret ( hadde
egenskapen) - "lthøygt"
- "høy" adj pos m/f ub ent ( ikke lav)
- "høy" subst nøyt appell ent ub ( gress)
10How to Find Out About Ambiguity?
- We checked each word against the Internet edition
of Multitaggeren - It provides information on grammatical properties
of inflected words (e.g. bilde (picture), but
also bilder (pictures)) - It does not provide any information on the words
meaning
11Ambiguity as Homonymy, not Polysemy
- Relevantputevar and følsom meanings of var
- Irrelevant the-lowest-part-of-a-leg and
- the-lowest-part-of-a-mountain meanings of fot
(foot))
12The Work Process
- For each search word we
- Checked the word for ambiguity
- Investigated what kind of ambiguity was involved
- Investigated how grammatical categories and
properties correlated with the different meanings - Searched on the Internet using Fasts search
engine Alltheweb.com and registered the
distribution of the different meanings - Filled in a form
13The Form
14An Example of a Completed Form
15Filling In the Form The Header
- number the absolute number of searches for that
particular word which are registered in the log
files (e.g. 450 searches for the word rose
(rose)) - norsk Norwegian was the chosen language for
search pages
16Filling In the Form Ambiguity
- The search word was checked up against
Multitaggeren. If there was more than one entry,
ja (yes) would be written in the Ambiguity-field
in the form - e.g. word lærer would reveal two meanings
- - subst appell mask ent ub
- - verb pres
17Filling In the Form Ambiguity
- For each ambiguous word, its grammatical
categories were written down together with the
meanings associated with each grammatical
category - The percentage of the search results for each
specific meaning was also registered. Â
18Filling In the Form Important for Search?
- Not all types of ambiguity have been considered
important - For instance, word Lærer is ambiguous because it
means both a person and an action, but the
meanings are closely related - In such cases the word nei (no) was written in
this field - Â Otherwise, the word ja (yes) was filled in
19Filling In the Form Inflected Form
- If the search word were an inflected form, we
filled this field with information on the words
grammatical properties - bilder (pictures) ub fl
- høyskolen (the college)
- be ent
- jenter (girls) ub fl
- sitater (quotes) ub fl
- venner (friends) ub fl
20Filling In the Form Name
21Filling In the Form Note
- Note ok
- When the word was not found to be ambiguous the
comment ok was written in the Note-field - Examples
- ringetone (calling signal)
- matematikk (mathematics)
- bunad (national costume)
- epostleser (e-mail reader)
- mobiltelefon (cellular phone)
22Filling In the Form Note Note irr
(irrelevant)
- words in a foreign language f.ex. you, car,
history, wars, tjejer - numbers 1981, 10
- single letters (except e for ecstasy)
- words which contain signs which are not letters
or numbers, e.g. oslo, .no
23Filling In the Form Note Note poly (Polysemy)
- Polysemic ambiguity was not accounted for in this
investigation - Nevertheless, the polysemic meanings of a search
word, were listed in this field
24Some of the Results in Numbers
25Concrete suggestions which involve grammatical
tagging
- Grammatical tagging of words
- Name recognition
26Concrete suggestions which involve grammatical
taggingGrammatical Tagging of Words
- Tagging would solve some ambiguities
- Kort (card) noun (56 hits) vs.
- Kort (short) adjective (32 hits)
- But some are beyond reach of a word tagger
- Kort (playing card) noun vs.
- Kort (credit card) noun
27Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes
- I. Check each search criterion against a
full-form dictionary - hopper (noun, singular) (ski jumper)
- hopper (noun, plural) (mares)
28Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes 1
- Provide one of the following
- A dialogue box for grammatical categories
- Would you like to search for a singular noun
- (S), or a plural noun (P)?
- Drawback Many users are not familiar with
- grammatical terminology
29Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog-boxes 2
- A dialogue box for word meanings
- Do you want to find out about one who takes part
- in ski-jumping (1) or about the female of any
- equine animal (as the horse, ass, or zebra)
(2)? - Drawback This requires a link to a dictionary
of definitions
30Grammatical Tagging of WordsGrammatical Tagging
Combined with Dialog boxes 3
- A dialogue box based on sorting of hits
- Do you want hits of type 1 or type 2?
- 1. Russisk hopper utestengt to år, Oslo
- Den russiske hopperen Dimitri Vassiljev er
utestengt for to år - etter å ha testet positiv på en dopingtest
www.api.no/0/04/78/1.html - 2. Register for hele Norge og Sverige med
opplysninger om - drektige hopper i samtlige hest- og ponniraser
w1.2671.telia.com/u267102940/folljour.html
31Grammatical Tagging of WordsGrammatical Tagging
Combined with Sorting of Hits
- Avoid interactions with the user, sort the hits
after tagging - AV (adjective, audiovisual) 5 of the hits
- av (preposition, of). 95 of hits
- Â
- Show content words before function words!
32Grammatical Tagging of WordsGrammatical Tagging
Combined with Sorting of Hits
- Drawbacks
- Unpredictable interest
- kort (short/ card) adjective or noun?
- Several grammatical forms may be equally
interesting - hopper (jumper) noun singular
- (jumps) verb present
33Grammatical Tagging of WordsGrammatical Tagging
Used to Find all Inflected Forms
- Internet users tend to choose only one inflected
form of a word as a search criterion, e.g. rose
(rose) or roser (roses) - Â Consequence They miss relevant hits
- Â
34Grammatical Tagging of WordsGrammatical Tagging
Used to Find all Inflected Forms
- Solution
- Find all forms of a relevant search word!
- Use a grammatical tagger rather than just partial
matching, to get suppletive forms, e.g. gås
(goose), gåsa (the goose), gjess (geese), and
gjessene (the geese).
35Concrete suggestions which involve grammatical
taggingName Recognition
- The search engines do not usually distinguish
between upper and lower case letters. - Consequence They fail to distinguish between
proper names and appellatives, for instance, Dag
(male name) 27 of the hits - dag (day) 65 of the hits
- Solution distinguish upper and lower case!
36Concrete suggestions which involve grammatical
taggingName Recognition
- Drawbacks
- Some proper names are ambiguous
- Java
- programming language, 99 of the hits
- the island, 1 of the hits
37Conclusion
- Every fourth word used on the Fast Search Engine
is ambiguous - Almost 15 of all search words were ambiguous
with words that do not have the same semantic
origin.
38Conclusion
-
- It would be of great help to have search engines
which enable users to choose the preferred
meaning of ambiguous words. Using a grammatical
tagger would be an important step in this
direction.