Destination Japan: Internationalization of the Lycos Search Engine - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Destination Japan: Internationalization of the Lycos Search Engine

Description:

Spiders crawl the web retrieving documents for indexing. ... Poise Lycos for even quicker deployment for other languages. Get 'more bang for the buck' ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 41
Provided by: tina54
Category:

less

Transcript and Presenter's Notes

Title: Destination Japan: Internationalization of the Lycos Search Engine


1
Destination JapanInternationalization of the
Lycos Search Engine
  • Presented by
  • Eric Gardner of Lycos, Inc.
  • Tina Lieu of Basis Technology Corp.

2
Lycos Search Technology
  • Created at CMU by professor Fuzzy Mauldin
    students in 1994-95.
  • Intelligent spidering methods (now patented).
    Spiders crawl the web retrieving documents for
    indexing.
  • Back-end database of webpages (web catalog)
  • Query engine with relevancy algorithms for
    ordering search results.
  • Not internationalized

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Two Main Functions of aSearch Engine
  • Building a catalog
  • Input webpages
  • Output inverted word index
  • Performing a query
  • Input keywords and other search parameters
  • Output list of matching webpages

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Scale of the Task
  • At Lycos we need to design for

13
Scale of the Task
  • At Lycos we need to design for
  • 100,000,000 users daily

14
Scale of the Task
  • At Lycos we need to design for
  • 100,000,000 users daily
  • 1,000,000,000 documents

15
Scale of the Task
  • At Lycos we need to design for
  • 100,000,000 users daily
  • 1,000,000,000 documents
  • 10,000,000,000,000 bytes of data

16
First Stop Europe
  • Lycos search technology initially for ASCII only.
    In-house work to make data paths 8-bit clean, to
    accommodate European languages.
  • Otherwise relatively straightforwardcomponents
    such as ad servers, web servers, etc., require
    little if any changes.
  • European service came online in May 1997.

17
Next Stop Japan
  • Business and strategic reasons to introduce
    Japanese search.
  • Not a lot of international(ization) experience
    within Lycos at the time.
  • We needed assistance and chose Basis Technology.

18
Goals
  • Quick deployment of Japanese search
  • 1995 to 1997, Japanese Internet more than
    doubling each year
  • Marketing need to launch in Japan ASAP
  • Economical, extensible solution
  • Produce reusable internationalized code
  • Poise Lycos for even quicker deployment for other
    languages
  • Get "more bang for the buck"

19
Japanese Issues for Catalog
  • Double-Byte Japanese characters are double-byte.
  • Multiple encodings Japanese webpages use 3
    encodings Shift-JIS, EUC-JP, and ISO-2022-JP.
  • Options Multiple vs. Single Catalog
  • Three catalogs one in Shift-JIS, one in EUC-JP,
    one in ISO-2022-JP (an awkward and complicated
    solution to implement)OR
  • One catalog all catalog data either in one
    Japanese encoding or in Unicode

20
Single Catalog Options
  • A) Convert all data to one Japanese encoding
  • ISO-2022-JP, Shift-JIS, or EUC-JP
  • Handles only Japanese English
  • B) Convert all data to UnicodeThe quick and
    economical choice, Unicode is . . .
  • A superset of all scripts and character set
    encodings used on the Web, therefore reusable for
    other languages
  • More easily implemented into existing code
    originally written for processing single-byte
    ASCII

21
The Unicode Plan
  • Use Unicode in catalog internal processing
  • Because all electronic text on the Web maps
    cleanly into Unicode
  • Required elements
  • Character encoding conversions
  • Encoding auto-detection
  • Japanese word breaking

22
Encoding Conversion
  • Purpose Convert data between encodings used on
    the Web and Unicode (which is still not used
    universally on the Web)
  • From ?? in Shift-JIS you want ?? in Unicode
  • Functionality provided by Basis Technology's
    Rosette embedded in Lycos code as source
  • Rosette is a cross-platform C library for
    Unicode http//www.basistech.com/products/
  • Complete set of mapping tables between Unicode
    and major legacy encodings
  • Conversions performed quickly and economically
    with minimal impact on performance

23
Encoding Auto-Detection
  • Purpose to correctly identify encoding of
    webpage or query in order to convert properly
    from one encoding to another.
  • Functionality provided by Basis Technology's
    Rosette
  • Auto-detection on Japanese text in Shift-JIS,
    EUC-JP, or ISO-2022-JP encodings
  • Enhanced tiebreaker functionality to auto-detect
    very short strings (queries)

24
Why Encoding Auto-Detection? (1)
  • In order to convert text to another encoding, you
    have to know where youre starting from. Or you
    could get . . .
  • Ex. Text in EUC-JP when viewed as other
    encodings.
  • EUC-JP ?? ?????? ??
  • Shift-JIS ??? ?????????? ????
  • ASCII

25
Why Encoding Auto-Detection? (2)
  • Very few web publishers announce character set as
    allowed by HTTP and HTML standardsEx.)
  • HTTP (via web server)Content-type text/html
    charsetShift_JIS
  • HTML (in the HTML file)ltMETA HTTP-EQUIVContent
    -type CONTENT text/html charsetShift_JISgt
    ltMETA HTTP-EQUIV"Content-Type"
    CONTENT"text/html charsetISO-8859-1"gt

26
Japanese Word Breaking
  • Problem to solve Japanese words are not
    delimited by spaces
  • Purpose To return indexable units (words) for
    creating an index, or for breaking the query into
    words to look up in the index.
  • Solution Basis Technology's Japanese
    Morphological Analyzer (http//www.basistech.com/
    products/)
  • Dictionary-based Japanese word breaking
  • Elimination of stop words (ex. a,the, etc.)
  • Looks for longest word match

27
Selecting Unicode Representation (1) UCS2
characteristics
  • Depending on the task, either the UCS2 or UTF8
    representation of Unicode was used in different
    parts of the Lycos search
  • Characteristics of UCS2
  • Each coded character element is fixed width, 16
    bits
  • Data paths must all accommodate 16 bits
  • Text in UCS2 is easy to manipulate and analyze
    (from a programming viewpoint)

28
Selecting Unicode Representation (2) UTF8
characteristics
  • Characteristics of UTF8
  • Each coded character is composed of one to six
    octets (one octet 8 bits)
  • Data paths need only be "8-bit clean"
  • None of the octets in a multi-byte character are
    null (i.e., has the value of zero)
  • Text in UTF8 is difficult to manipulate or
    analyze.
  • "8-bit clean" computer code which treats all 8
    bits of a byte as significant. True of any
    computer code that processes European languages
    properly, but not necessarily true of code that
    processes only ASCII which only uses 7 bits per
    character.

29
UCS2, UTF8, ASCII, etc.16-bit UCS2
cant fit ( 8-bit clean data pipe
As UTF8 As UCS2 ASCII (7 bits) Latin
character (8-bits) (w/diacritical) Japanese
character (double-byte) (in Shift-JIS, EUC-JP
etc.)
30
Unicode in the Lycos System
  • UCS2 Japanese Morphological Analyzer from Basis
    Technology
  • Using UCS2 is the quick and economical way to
    process huge volumes of Japanese text.
  • UTF8 Lycos Catalog
  • Economy of disk space ASCII is smaller in
    UTF8On the Web ASCII 79, double-byte Asian
    less than 5, European encodings and others 16
  • Ease of integration with existing code (a.k.a.
    transmissibility)
  • Based on the number of Web hosts on the Internet
    by country (total number of hosts for
    English-speaking domains as a percentage of the
    total number of hosts worldwide). Source Survey
    by Network Wizards, http//www.nw.com

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Project Complete Lycos Japan (1)
  • Quick
  • Prototype in two months
  • Beta version of Japanese search debuts July 1998
    enters competitive Japanese search engine race in
    4th place
  • Upon formal launch grabs 2nd place in October
    1998
  • Lycos Japan moves into 1st place as of June 1999
  • Lycos Japan http//www.lycos.co.jp/
  • According to Search Desk, http//www.searchdesk.c
    om/

36
(No Transcript)
37
Project Complete Lycos Japan (2)
  • E-conomicalToday, Lycos has spider, catalog and
    query software, which may easily be set to make
    catalogs in different languages by swapping in
    and out localized pieces
  • Settings for target domains
  • Encoding detection and conversion calls
  • Language-specific word breaker (if needed)

38
QA

39
QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om
40
QA Questions? tina_at_basistech.com ww
w.basistech.com egardner_at_lycos.com www.lycos.c
om Thank you!
Write a Comment
User Comments (0)
About PowerShow.com