Destination Japan: Internationalization of the Lycos Search Engine - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Destination Japan: Internationalization of the Lycos Search Engine

Description:

Lycos, Tripod, Angelfire, HotBot. A 'hub' Search Engine & Navigation ... Project Complete: Lycos Japan (1) Quick: Prototype of Japanese search is produced in ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 36
Provided by: tina54
Category:

less

Transcript and Presenter's Notes

Title: Destination Japan: Internationalization of the Lycos Search Engine


1
Destination JapanInternationalization of the
Lycos Search Engine
  • Presented by
  • Jeff Vander Clute of Lycos, Inc.
  • Tina Lieu of Basis Technology Corp.

2
Lycos is...
  • A new generation Web company
  • - 4 top 20 Web properties in Network
  • - Lycos, Tripod, Angelfire, HotBot
  • A hub
  • Search Engine Navigation
  • - Patented search directory technology
  • Community Communication
  • E-commerce, Content Aggregation, Etc.

3
The Search Technology
  • Created by CMU professor (Fuzzy Mauldin)
    students in 1994/95.
  • 1. Intelligent spidering methods (now
    patented), but not internationalized. Spiders
    crawl the web retrieving documents for indexing.
  • 2. Back-end database of webpages, or catalog,
    plus relevancy algorithms for ordering
    search results.

4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
First Stop Europe
  • Lycos search technology initially for ASCII only.
    In-house work to make data paths 8-bit clean, to
    accommodate European languages.
  • Otherwise relatively straightforward. Components
    such as ad servers, Web servers, etc., require
    little if any changes.
  • Euro service came online in May 1997.

8
Whats Unicode? Wheres Japan?
  • The more interesting problem.
  • Business reasons to introduce Japanese search.
  • But not a lot of international(ization)
    experience within Lycos at the time.
  • We needed assistance and chose Basis Technology.

9
Goals
  • Quick deployment of Japanese search
  • 1995 to 1997, Japanese Internet more than
    doubling each year
  • Marketing need to launch in Japan ASAP
  • Economical and efficient solution
  • Produce reusable internationalized code
  • Poise Lycos for even quicker deployment into
    other languages
  • Get "more bang for the buck"

10
Two Main Functions of aSearch Engine
  • Building a catalogCompiling an indexed catalog
    of webpages from the Internet
  • Performing a queryDelivering a list of webpages
    matching certain keywords and parameters input by
    the user

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Japanese Issues for Catalog
  • Double-Byte Japanese characters are double-byte.
  • Multiple encodings Japanese webpages use 3
    encodings Shift-JIS, EUC-JP, and ISO-2022-JP.
  • Options Multiple vs. Single Catalog
  • Three catalogs one in Shift-JIS, one in EUC-JP,
    one in ISO-2022-JP (an awkward and complicated
    solution to implement)OR
  • One catalog all catalog data either in one
    Japanese encoding or in Unicode

16
Single Catalog Options
  • A) Convert all data to one Japanese encoding
  • ISO-2022-JP, Shift-JIS, or EUC-JP
  • B) Convert all data to Unicode
  • The quick and economical choice, Unicode is . . .
  • A superset of all scripts and character set
    encodings used on the Web, therefore reusable for
    other languages
  • More easily implemented into existing code
    originally written for processing single-byte
    ASCII

17
The Unicode Plan
  • Use Unicode in catalog internal processing
  • Because all electronic text on the Web maps
    cleanly into Unicode
  • Required elements
  • Character encoding conversions Unicode
    webpage encodings (webpage encodings
    Shift-JIS, ISO-2022-JP, and EUC-JP)
  • Encoding auto-detection
  • Japanese word breaking

18
Encoding Conversion
  • Purpose Convert data between encodings used on
    the Web and Unicode (which is still not used
    universally on the Web)
  • From ?? in Shift-JIS you want ?? in Unicode
  • Functionality provided by Basis Technology's
    Rosette embedded in Lycos code as source
  • Rosette is a cross-platform C library for
    Unicode http//www.basistech.com/products/
  • Complete set of mapping tables between Unicode
    and major legacy encodings
  • Conversions performed quickly and economically
    with minimal impact on performance

19
Why Encoding Auto-Detection?
  • In order to convert text to another encoding, you
    have to know where youre starting from. Or you
    could get . . .
  • Ex. Text in EUC-JP when viewed as other
    encodings.
  • EUC-JP ?? ?????? ??
  • Shift-JIS ??? ?????????? ????
  • ASCII

20
Encoding Auto-Detection
  • Purpose to correctly identify encoding of
    webpage or query in order to convert properly
    from one encoding to another.
  • Functionality provided by Basis Technology's
    Rosette
  • Auto-detection on Japanese text in Shift-JIS,
    EUC-JP, or ISO-2022-JP encodings
  • Enhanced tiebreaker functionality to auto-detect
    very short strings (queries)

21
Japanese Word Breaking
  • Purpose To return indexable units (words) for
    creating an index, or for breaking the query into
    words to look up in the index.
  • Problem Japanese words are not delimited by
    spaces
  • Solution Basis Technology's Japanese
    Morphological Analyzer (http//www.basistech.com/
    products/)
  • Dictionary-based Japanese word breaking
  • Elimination of stop words (ex. a,the, etc.)
  • Looks for longest word match

22
Selecting Unicode Representation (1) UCS2
characteristics
  • Depending on the task, either the UCS2 or UTF8
    representation of Unicode was used in different
    parts of the Lycos search
  • Characteristics of UCS2
  • Each coded character element is fixed width, 16
    bits
  • Data paths must all accommodate 16 bits
  • Text in UCS2 is easy to manipulate and analyze
    (from a programming viewpoint)

23
Selecting Unicode Representation (2) UTF8
characteristics
  • Characteristics of UTF8
  • Each coded character is composed of one to six
    octets (one octet 8 bits)
  • Data paths need only be "8-bit clean"
  • None of the octets in a multi-byte character are
    null (i.e., has the value of zero)
  • Text in UTF8 is difficult to manipulate or
    analyze.
  • "8-bit clean" computer code which treats all 8
    bits of a byte as significant. True of any
    computer code that processes European languages
    properly, but not necessarily true of code that
    processes only ASCII which only uses 7 bits per
    character.

24
UCS2, UTF8, ASCII, etc.16-bit UCS2
cant fit ( 8-bit clean data pipe
As UTF8 As UCS2 ASCII (7 bits) Latin
character (8-bits) (w/diacritical) Japanese
character (double-byte) (in Shift-JIS, EUC-JP
etc.)
25
Unicode in the Lycos System
  • UCS2 Japanese Morphological Analyzer from Basis
    Technology
  • Using UCS2 is the quick and economical way to
    process huge volumes of Japanese text.
  • UTF8 Lycos Catalog
  • Economy of disk space ASCII is smaller in
    UTF8On the Web ASCII 79, double-byte Asian
    less than 5, European encodings and others 16
  • Ease of integration with existing code (a.k.a.
    transmissibility)
  • Based on the number of Web hosts on the Internet
    by country (total number of hosts for
    English-speaking domains as a percentage of the
    total number of hosts worldwide). Source Survey
    by Network Wizards, http//www.nw.com

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Project Complete Lycos Japan (1)
  • QuickPrototype of Japanese search is produced
    in two months.Lycos Japan http//www.lycos.co.jp
  • Beta version of Japanese search debuts July 1998
    enters competitive Japanese search engine race in
    4th place
  • Upon formal launch grabs 2nd place in October
    1998According to Search Desk,
    http//www.searchdesk.com

31
(No Transcript)
32
Project Complete Lycos Japan (2)
  • E-conomicalToday, Lycos has spider, catalog and
    query software, which may easily be set to make
    catalogs in different languages by swapping in
    and out localized pieces
  • Settings for target domains
  • Encoding detection and conversion calls
  • Language-specific word breaker (if needed)

33
QA

34
QA Questions? tina_at_basistech.com ww
w.basistech.com jvanderclute_at_lycos.com www.lyc
os.com
35
QA Questions? tina_at_basistech.com ww
w.basistech.com jvanderclute_at_lycos.com www.lyc
os.com Thank you!
Write a Comment
User Comments (0)
About PowerShow.com