Destination Japan: Internationalization of the Lycos Search Engine - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Destination Japan: Internationalization of the Lycos Search Engine

Description:

Spiders crawl the web retrieving documents for indexing. ... Poise Lycos for even quicker deployment for other languages. Get 'more bang for the buck' ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 41

Provided by: tina54

Category:

more less

Transcript and Presenter's Notes

Title: Destination Japan: Internationalization of the Lycos Search Engine

1
Destination JapanInternationalization of the
Lycos Search Engine

Presented by
Eric Gardner of Lycos, Inc.
Tina Lieu of Basis Technology Corp.

2
Lycos Search Technology

Created at CMU by professor Fuzzy Mauldin
students in 1994-95.
Intelligent spidering methods (now patented).
Spiders crawl the web retrieving documents for
indexing.
Back-end database of webpages (web catalog)
Query engine with relevancy algorithms for
ordering search results.
Not internationalized

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Two Main Functions of aSearch Engine

Building a catalog
Input webpages
Output inverted word index
Performing a query
Input keywords and other search parameters
Output list of matching webpages

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Scale of the Task

At Lycos we need to design for

13
Scale of the Task

At Lycos we need to design for
100,000,000 users daily

14
Scale of the Task

At Lycos we need to design for
100,000,000 users daily
1,000,000,000 documents

15
Scale of the Task

At Lycos we need to design for
100,000,000 users daily
1,000,000,000 documents
10,000,000,000,000 bytes of data

16
First Stop Europe

Lycos search technology initially for ASCII only.
In-house work to make data paths 8-bit clean, to
accommodate European languages.
Otherwise relatively straightforwardcomponents
such as ad servers, web servers, etc., require
little if any changes.
European service came online in May 1997.

17
Next Stop Japan

Business and strategic reasons to introduce
Japanese search.
Not a lot of international(ization) experience
within Lycos at the time.
We needed assistance and chose Basis Technology.

18
Goals

Quick deployment of Japanese search
1995 to 1997, Japanese Internet more than
doubling each year
Marketing need to launch in Japan ASAP
Economical, extensible solution
Produce reusable internationalized code
Poise Lycos for even quicker deployment for other
languages
Get "more bang for the buck"

19
Japanese Issues for Catalog

Double-Byte Japanese characters are double-byte.
Multiple encodings Japanese webpages use 3
encodings Shift-JIS, EUC-JP, and ISO-2022-JP.
Options Multiple vs. Single Catalog
Three catalogs one in Shift-JIS, one in EUC-JP,
one in ISO-2022-JP (an awkward and complicated
solution to implement)OR
One catalog all catalog data either in one
Japanese encoding or in Unicode

20
Single Catalog Options

A) Convert all data to one Japanese encoding
ISO-2022-JP, Shift-JIS, or EUC-JP
Handles only Japanese English
B) Convert all data to UnicodeThe quick and
economical choice, Unicode is . . .
A superset of all scripts and character set
encodings used on the Web, therefore reusable for
other languages
More easily implemented into existing code
originally written for processing single-byte
ASCII

21
The Unicode Plan

Use Unicode in catalog internal processing
Because all electronic text on the Web maps
cleanly into Unicode
Required elements
Character encoding conversions
Encoding auto-detection
Japanese word breaking

22
Encoding Conversion

Purpose Convert data between encodings used on
the Web and Unicode (which is still not used
universally on the Web)
From ?? in Shift-JIS you want ?? in Unicode
Functionality provided by Basis Technology's
Rosette embedded in Lycos code as source
Rosette is a cross-platform C library for
Unicode http//www.basistech.com/products/
Complete set of mapping tables between Unicode
and major legacy encodings
Conversions performed quickly and economically
with minimal impact on performance

23
Encoding Auto-Detection

Purpose to correctly identify encoding of
webpage or query in order to convert properly
from one encoding to another.
Functionality provided by Basis Technology's
Rosette
Auto-detection on Japanese text in Shift-JIS,
EUC-JP, or ISO-2022-JP encodings
Enhanced tiebreaker functionality to auto-detect
very short strings (queries)

24
Why Encoding Auto-Detection? (1)

In order to convert text to another encoding, you
have to know where youre starting from. Or you
could get . . .
Ex. Text in EUC-JP when viewed as other
encodings.
EUC-JP ?? ?????? ??
Shift-JIS ??? ?????????? ????
ASCII

25
Why Encoding Auto-Detection? (2)

Very few web publishers announce character set as
allowed by HTTP and HTML standardsEx.)
HTTP (via web server)Content-type text/html
charsetShift_JIS
HTML (in the HTML file)ltMETA HTTP-EQUIVContent
-type CONTENT text/html charsetShift_JISgt
ltMETA HTTP-EQUIV"Content-Type"
CONTENT"text/html charsetISO-8859-1"gt

26
Japanese Word Breaking

Problem to solve Japanese words are not
delimited by spaces
Purpose To return indexable units (words) for
creating an index, or for breaking the query into
words to look up in the index.
Solution Basis Technology's Japanese
Morphological Analyzer (http//www.basistech.com/
products/)
Dictionary-based Japanese word breaking
Elimination of stop words (ex. a,the, etc.)
Looks for longest word match

27
Selecting Unicode Representation (1) UCS2
characteristics

Depending on the task, either the UCS2 or UTF8
representation of Unicode was used in different
parts of the Lycos search
Characteristics of UCS2
Each coded character element is fixed width, 16
bits
Data paths must all accommodate 16 bits
Text in UCS2 is easy to manipulate and analyze
(from a programming viewpoint)

28
Selecting Unicode Representation (2) UTF8
characteristics

Characteristics of UTF8
Each coded character is composed of one to six
octets (one octet 8 bits)
Data paths need only be "8-bit clean"
None of the octets in a multi-byte character are
null (i.e., has the value of zero)
Text in UTF8 is difficult to manipulate or
analyze.
"8-bit clean" computer code which treats all 8
bits of a byte as significant. True of any
computer code that processes European languages
properly, but not necessarily true of code that
processes only ASCII which only uses 7 bits per
character.

29
UCS2, UTF8, ASCII, etc.16-bit UCS2
cant fit ( 8-bit clean data pipe
As UTF8 As UCS2 ASCII (7 bits) Latin
character (8-bits) (w/diacritical) Japanese
character (double-byte) (in Shift-JIS, EUC-JP
etc.)
30
Unicode in the Lycos System

UCS2 Japanese Morphological Analyzer from Basis
Technology
Using UCS2 is the quick and economical way to
process huge volumes of Japanese text.
UTF8 Lycos Catalog
Economy of disk space ASCII is smaller in
UTF8On the Web ASCII 79, double-byte Asian
less than 5, European encodings and others 16
Ease of integration with existing code (a.k.a.
transmissibility)
Based on the number of Web hosts on the Internet
by country (total number of hosts for
English-speaking domains as a percentage of the
total number of hosts worldwide). Source Survey
by Network Wizards, http//www.nw.com

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Project Complete Lycos Japan (1)

Quick
Prototype in two months
Beta version of Japanese search debuts July 1998
enters competitive Japanese search engine race in
4th place
Upon formal launch grabs 2nd place in October
1998
Lycos Japan moves into 1st place as of June 1999
Lycos Japan http//www.lycos.co.jp/
According to Search Desk, http//www.searchdesk.c
om/

36
(No Transcript)
37
Project Complete Lycos Japan (2)

E-conomicalToday, Lycos has spider, catalog and
query software, which may easily be set to make
catalogs in different languages by swapping in
and out localized pieces
Settings for target domains
Encoding detection and conversion calls
Language-specific word breaker (if needed)

38
QA

39
QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om
40
QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om Thank you!

Write a Comment

User Comments (0)