Title: Destination Japan: Internationalization of the Lycos Search Engine
1Destination JapanInternationalization of the
Lycos Search Engine
- Presented by
- Jeff Vander Clute of Lycos, Inc.
- Tina Lieu of Basis Technology Corp.
2Lycos is...
- A new generation Web company
- - 4 top 20 Web properties in Network
- - Lycos, Tripod, Angelfire, HotBot
- A hub
- Search Engine Navigation
- - Patented search directory technology
- Community Communication
- E-commerce, Content Aggregation, Etc.
3The Search Technology
- Created by CMU professor (Fuzzy Mauldin)
students in 1994/95. - 1. Intelligent spidering methods (now
patented), but not internationalized. Spiders
crawl the web retrieving documents for indexing. - 2. Back-end database of webpages, or catalog,
plus relevancy algorithms for ordering
search results.
4(No Transcript)
5(No Transcript)
6(No Transcript)
7First Stop Europe
- Lycos search technology initially for ASCII only.
In-house work to make data paths 8-bit clean, to
accommodate European languages. - Otherwise relatively straightforward. Components
such as ad servers, Web servers, etc., require
little if any changes. - Euro service came online in May 1997.
8Whats Unicode? Wheres Japan?
- The more interesting problem.
- Business reasons to introduce Japanese search.
- But not a lot of international(ization)
experience within Lycos at the time. - We needed assistance and chose Basis Technology.
9Goals
- Quick deployment of Japanese search
- 1995 to 1997, Japanese Internet more than
doubling each year - Marketing need to launch in Japan ASAP
- Economical and efficient solution
- Produce reusable internationalized code
- Poise Lycos for even quicker deployment into
other languages - Get "more bang for the buck"
10Two Main Functions of aSearch Engine
- Building a catalogCompiling an indexed catalog
of webpages from the Internet - Performing a queryDelivering a list of webpages
matching certain keywords and parameters input by
the user
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Japanese Issues for Catalog
- Double-Byte Japanese characters are double-byte.
- Multiple encodings Japanese webpages use 3
encodings Shift-JIS, EUC-JP, and ISO-2022-JP. - Options Multiple vs. Single Catalog
- Three catalogs one in Shift-JIS, one in EUC-JP,
one in ISO-2022-JP (an awkward and complicated
solution to implement)OR - One catalog all catalog data either in one
Japanese encoding or in Unicode
16Single Catalog Options
- A) Convert all data to one Japanese encoding
- ISO-2022-JP, Shift-JIS, or EUC-JP
- B) Convert all data to Unicode
- The quick and economical choice, Unicode is . . .
- A superset of all scripts and character set
encodings used on the Web, therefore reusable for
other languages - More easily implemented into existing code
originally written for processing single-byte
ASCII
17The Unicode Plan
- Use Unicode in catalog internal processing
- Because all electronic text on the Web maps
cleanly into Unicode - Required elements
- Character encoding conversions Unicode
webpage encodings (webpage encodings
Shift-JIS, ISO-2022-JP, and EUC-JP) - Encoding auto-detection
- Japanese word breaking
18Encoding Conversion
- Purpose Convert data between encodings used on
the Web and Unicode (which is still not used
universally on the Web) - From ?? in Shift-JIS you want ?? in Unicode
- Functionality provided by Basis Technology's
Rosette embedded in Lycos code as source - Rosette is a cross-platform C library for
Unicode http//www.basistech.com/products/ - Complete set of mapping tables between Unicode
and major legacy encodings - Conversions performed quickly and economically
with minimal impact on performance
19Why Encoding Auto-Detection?
- In order to convert text to another encoding, you
have to know where youre starting from. Or you
could get . . . - Ex. Text in EUC-JP when viewed as other
encodings. - EUC-JP ?? ?????? ??
- Shift-JIS ??? ?????????? ????
- ASCII
20Encoding Auto-Detection
- Purpose to correctly identify encoding of
webpage or query in order to convert properly
from one encoding to another. - Functionality provided by Basis Technology's
Rosette - Auto-detection on Japanese text in Shift-JIS,
EUC-JP, or ISO-2022-JP encodings - Enhanced tiebreaker functionality to auto-detect
very short strings (queries)
21Japanese Word Breaking
- Purpose To return indexable units (words) for
creating an index, or for breaking the query into
words to look up in the index. - Problem Japanese words are not delimited by
spaces - Solution Basis Technology's Japanese
Morphological Analyzer (http//www.basistech.com/
products/) - Dictionary-based Japanese word breaking
- Elimination of stop words (ex. a,the, etc.)
- Looks for longest word match
22Selecting Unicode Representation (1) UCS2
characteristics
- Depending on the task, either the UCS2 or UTF8
representation of Unicode was used in different
parts of the Lycos search - Characteristics of UCS2
- Each coded character element is fixed width, 16
bits - Data paths must all accommodate 16 bits
- Text in UCS2 is easy to manipulate and analyze
(from a programming viewpoint)
23Selecting Unicode Representation (2) UTF8
characteristics
- Characteristics of UTF8
- Each coded character is composed of one to six
octets (one octet 8 bits) - Data paths need only be "8-bit clean"
- None of the octets in a multi-byte character are
null (i.e., has the value of zero) - Text in UTF8 is difficult to manipulate or
analyze. - "8-bit clean" computer code which treats all 8
bits of a byte as significant. True of any
computer code that processes European languages
properly, but not necessarily true of code that
processes only ASCII which only uses 7 bits per
character.
24 UCS2, UTF8, ASCII, etc.16-bit UCS2
cant fit ( 8-bit clean data pipe
As UTF8 As UCS2 ASCII (7 bits) Latin
character (8-bits) (w/diacritical) Japanese
character (double-byte) (in Shift-JIS, EUC-JP
etc.)
25Unicode in the Lycos System
- UCS2 Japanese Morphological Analyzer from Basis
Technology - Using UCS2 is the quick and economical way to
process huge volumes of Japanese text. - UTF8 Lycos Catalog
- Economy of disk space ASCII is smaller in
UTF8On the Web ASCII 79, double-byte Asian
less than 5, European encodings and others 16 - Ease of integration with existing code (a.k.a.
transmissibility) - Based on the number of Web hosts on the Internet
by country (total number of hosts for
English-speaking domains as a percentage of the
total number of hosts worldwide). Source Survey
by Network Wizards, http//www.nw.com
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Project Complete Lycos Japan (1)
- QuickPrototype of Japanese search is produced
in two months.Lycos Japan http//www.lycos.co.jp
- Beta version of Japanese search debuts July 1998
enters competitive Japanese search engine race in
4th place - Upon formal launch grabs 2nd place in October
1998According to Search Desk,
http//www.searchdesk.com
31(No Transcript)
32Project Complete Lycos Japan (2)
- E-conomicalToday, Lycos has spider, catalog and
query software, which may easily be set to make
catalogs in different languages by swapping in
and out localized pieces - Settings for target domains
- Encoding detection and conversion calls
- Language-specific word breaker (if needed)
33QA
34QA Questions? tina_at_basistech.com ww
w.basistech.com jvanderclute_at_lycos.com www.lyc
os.com
35QA Questions? tina_at_basistech.com ww
w.basistech.com jvanderclute_at_lycos.com www.lyc
os.com Thank you!