Title: Destination Japan: Internationalization of the Lycos Search Engine
1Destination JapanInternationalization of the
Lycos Search Engine
- Presented by
- Eric Gardner of Lycos, Inc.
- Tina Lieu of Basis Technology Corp.
2Lycos Search Technology
- Created at CMU by professor Fuzzy Mauldin
students in 1994-95. - Intelligent spidering methods (now patented).
Spiders crawl the web retrieving documents for
indexing. - Back-end database of webpages (web catalog)
- Query engine with relevancy algorithms for
ordering search results. - Not internationalized
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Two Main Functions of aSearch Engine
- Building a catalog
- Input webpages
- Output inverted word index
- Performing a query
- Input keywords and other search parameters
- Output list of matching webpages
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Scale of the Task
- At Lycos we need to design for
13Scale of the Task
- At Lycos we need to design for
- 100,000,000 users daily
14Scale of the Task
- At Lycos we need to design for
- 100,000,000 users daily
- 1,000,000,000 documents
15Scale of the Task
- At Lycos we need to design for
- 100,000,000 users daily
- 1,000,000,000 documents
- 10,000,000,000,000 bytes of data
16First Stop Europe
- Lycos search technology initially for ASCII only.
In-house work to make data paths 8-bit clean, to
accommodate European languages. - Otherwise relatively straightforwardcomponents
such as ad servers, web servers, etc., require
little if any changes. - European service came online in May 1997.
17Next Stop Japan
- Business and strategic reasons to introduce
Japanese search. - Not a lot of international(ization) experience
within Lycos at the time. - We needed assistance and chose Basis Technology.
18Goals
- Quick deployment of Japanese search
- 1995 to 1997, Japanese Internet more than
doubling each year - Marketing need to launch in Japan ASAP
- Economical, extensible solution
- Produce reusable internationalized code
- Poise Lycos for even quicker deployment for other
languages - Get "more bang for the buck"
19Japanese Issues for Catalog
- Double-Byte Japanese characters are double-byte.
- Multiple encodings Japanese webpages use 3
encodings Shift-JIS, EUC-JP, and ISO-2022-JP. - Options Multiple vs. Single Catalog
- Three catalogs one in Shift-JIS, one in EUC-JP,
one in ISO-2022-JP (an awkward and complicated
solution to implement)OR - One catalog all catalog data either in one
Japanese encoding or in Unicode
20Single Catalog Options
- A) Convert all data to one Japanese encoding
- ISO-2022-JP, Shift-JIS, or EUC-JP
- Handles only Japanese English
- B) Convert all data to UnicodeThe quick and
economical choice, Unicode is . . . - A superset of all scripts and character set
encodings used on the Web, therefore reusable for
other languages - More easily implemented into existing code
originally written for processing single-byte
ASCII
21The Unicode Plan
- Use Unicode in catalog internal processing
- Because all electronic text on the Web maps
cleanly into Unicode - Required elements
- Character encoding conversions
- Encoding auto-detection
- Japanese word breaking
22Encoding Conversion
- Purpose Convert data between encodings used on
the Web and Unicode (which is still not used
universally on the Web) - From ?? in Shift-JIS you want ?? in Unicode
- Functionality provided by Basis Technology's
Rosette embedded in Lycos code as source - Rosette is a cross-platform C library for
Unicode http//www.basistech.com/products/ - Complete set of mapping tables between Unicode
and major legacy encodings - Conversions performed quickly and economically
with minimal impact on performance
23Encoding Auto-Detection
- Purpose to correctly identify encoding of
webpage or query in order to convert properly
from one encoding to another. - Functionality provided by Basis Technology's
Rosette - Auto-detection on Japanese text in Shift-JIS,
EUC-JP, or ISO-2022-JP encodings - Enhanced tiebreaker functionality to auto-detect
very short strings (queries)
24Why Encoding Auto-Detection? (1)
- In order to convert text to another encoding, you
have to know where youre starting from. Or you
could get . . . - Ex. Text in EUC-JP when viewed as other
encodings. - EUC-JP ?? ?????? ??
- Shift-JIS ??? ?????????? ????
- ASCII
25Why Encoding Auto-Detection? (2)
- Very few web publishers announce character set as
allowed by HTTP and HTML standardsEx.) - HTTP (via web server)Content-type text/html
charsetShift_JIS - HTML (in the HTML file)ltMETA HTTP-EQUIVContent
-type CONTENT text/html charsetShift_JISgt
ltMETA HTTP-EQUIV"Content-Type"
CONTENT"text/html charsetISO-8859-1"gt
26Japanese Word Breaking
- Problem to solve Japanese words are not
delimited by spaces - Purpose To return indexable units (words) for
creating an index, or for breaking the query into
words to look up in the index. - Solution Basis Technology's Japanese
Morphological Analyzer (http//www.basistech.com/
products/) - Dictionary-based Japanese word breaking
- Elimination of stop words (ex. a,the, etc.)
- Looks for longest word match
27Selecting Unicode Representation (1) UCS2
characteristics
- Depending on the task, either the UCS2 or UTF8
representation of Unicode was used in different
parts of the Lycos search - Characteristics of UCS2
- Each coded character element is fixed width, 16
bits - Data paths must all accommodate 16 bits
- Text in UCS2 is easy to manipulate and analyze
(from a programming viewpoint)
28Selecting Unicode Representation (2) UTF8
characteristics
- Characteristics of UTF8
- Each coded character is composed of one to six
octets (one octet 8 bits) - Data paths need only be "8-bit clean"
- None of the octets in a multi-byte character are
null (i.e., has the value of zero) - Text in UTF8 is difficult to manipulate or
analyze. - "8-bit clean" computer code which treats all 8
bits of a byte as significant. True of any
computer code that processes European languages
properly, but not necessarily true of code that
processes only ASCII which only uses 7 bits per
character.
29 UCS2, UTF8, ASCII, etc.16-bit UCS2
cant fit ( 8-bit clean data pipe
As UTF8 As UCS2 ASCII (7 bits) Latin
character (8-bits) (w/diacritical) Japanese
character (double-byte) (in Shift-JIS, EUC-JP
etc.)
30Unicode in the Lycos System
- UCS2 Japanese Morphological Analyzer from Basis
Technology - Using UCS2 is the quick and economical way to
process huge volumes of Japanese text. - UTF8 Lycos Catalog
- Economy of disk space ASCII is smaller in
UTF8On the Web ASCII 79, double-byte Asian
less than 5, European encodings and others 16 - Ease of integration with existing code (a.k.a.
transmissibility) - Based on the number of Web hosts on the Internet
by country (total number of hosts for
English-speaking domains as a percentage of the
total number of hosts worldwide). Source Survey
by Network Wizards, http//www.nw.com
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Project Complete Lycos Japan (1)
- Quick
- Prototype in two months
- Beta version of Japanese search debuts July 1998
enters competitive Japanese search engine race in
4th place - Upon formal launch grabs 2nd place in October
1998 - Lycos Japan moves into 1st place as of June 1999
- Lycos Japan http//www.lycos.co.jp/
- According to Search Desk, http//www.searchdesk.c
om/
36(No Transcript)
37Project Complete Lycos Japan (2)
- E-conomicalToday, Lycos has spider, catalog and
query software, which may easily be set to make
catalogs in different languages by swapping in
and out localized pieces - Settings for target domains
- Encoding detection and conversion calls
- Language-specific word breaker (if needed)
38QA
39QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om
40QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om Thank you!