Vo Pham Tra My - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Vo Pham Tra My

Description:

5 accents: Charset must be used to correctly display VNese characters. 11 /18 ... without accents. 14 /18. System Architecture. Vietnamese Search Engine ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 19
Provided by: PMMa
Category:
Tags: accents | anchors | pham | tra

less

Transcript and Presenter's Notes

Title: Vo Pham Tra My


1
Search Engine for Vietnamese language
  • Vo Pham Tra My
  • 20044318

2
Contents
  • Introduction
  • What is expected ?
  • Solution for general requirements
  • Search engines architecture for general
    requirements
  • Solution for the language problem
  • Search engines complete architecture
  • Conclusion
  • References

3
Introduction (1/2)
  • Internet is the richest source of information
  • Search engine is the most useful gateway to data
  • Powerful and famous search engines Google,
    Yahoo, AllTheWeb, Altavista
  • Capability of multilingual searching
  • Limitations for countries whose
  • Languages are different than English
  • Markets are small

4
Introduction (2/2)
  • Example
  • Search engine Google
  • Language Vietnamese
  • Facts
  • Result is not complete and not high-quality.
  • This presentation aims at
  • Suggesting a general design of search engine
    which is specific for VNese language
  • With reference to Googles architecture
  • With supplements for specific characteristics in
    the language.

5
What is expected ? (1/2)
  • General requirements
  • Features
  • File types text, HTML, pdf, image.
  • Cached links snapshot of a page as being crawled
  • Similar pages display pages that are related to
    a particular result
  • Quality
  • Most relevant pages on top
  • Up-to-date
  • Interface
  • Similar to Google simple and easy to everyone
  • Homepage has a textbox and a Search button
  • Result page shows 10 links a time, descending
    order
  • Commerce
  • Pay-per-placement customer buys keyword(s) to
    have their link at the top of result list, when
    that word(s) are seached
  • Database sell sell database to other engines
  • Advertisement display advertisement
    banners/links related to keyword

6
What is expected ? (2/2)
  • Requirement with the language problem
  • Complete, high-quality result

7
Solution for general requirements (1/2)
  • Features
  • File types text, HTML, pdf, image.
  • For text documents craw documents, read the
    files, create index for words in documents
  • For image craw the file, use specific tool to
    retrieve contents utilize anchor text, create
    index
  • Cached links snapshot of a page as being crawled
  • For each result, provide a cached link that
    leads user to the corresponding page stored in
    engines repository.
  • Similar pages display pages that are related to
    a particular result
  • Utilize the link graph of web pages

8
Solution for general requirements (2/2)
  • Quality
  • Most relevant pages on top
  • Use 2 measures
  • IR score
  • shows how important a word is to a page.
  • calculated by Indexer
  • PageRank
  • shows how important a page is, in global rank.
  • utilizes the link structure of web pages
  • Up-to-date
  • Repository and Index is updated once a week

9
Search engines architecture
  • For general requirements

StoreServer
URLServer
Anchors
Repository
URLResolver
Indexer
Lexicon
Links
Doc Index
PageRank
Searcher
10
Solution for language problem (1/6)
  • Characteristics of the language
  • 11 Vowels a, a, â, e, ê, o, ô, o, u, u, i.
  • Consonants similar to English alphabet
  • Extra d
  • No f, j, w, z
  • 5 accents
  • Charset must be used to correctly display VNese
    characters

.

?
11
Solution for language problem (2/6)
  • Why is search result not complete ?
  • No official charset is imposed for VNese web
    pages
  • Even some web pages dont use any charset
  • For those that use, there are various options
    Unicode, VNI, TCVN, Telex, VIQR
  • Some web pages may not appear in result list
    since they use charsets different than that used
    in database
  • Existing search engines
  • Google doesnt pay attention on charset problem
  • Vnese search engines does, but dont deal with
    another problem

12
Solution for language problem (3/6)
  • Why is search result not complete ?
  • Anchor text is an important source of information
  • Anchor text the text of link
  • Ex lthtmlgt /hotnews.html lt\htmlgt
  • Most VNese web pages have English anchor texts
    (short but informative)
  • engine creates different indices for different
    words whose meanings are the same

13
Solution for language problem (4/6)
  • Crawling

Unicode TCVN VNI
Converter
Telex VIQR
Words with accents Unicode
Words without accents
StoreServer
URLServer
Repository
14
Solution for language problem (5/6)
  • Searching

Key words with (or without) accents
Converter
If key words has accents, keyword
with accents, without accents Else
keyword without accents
Searcher
15
Solution for language problem (6/6)
  • Anchor text
  • Use a table to replace English words with
  • corresponding Vietnamese words

16
Search engines complete architecture
URLServer
StoreServer
Converter
Repository
Anchors
URLResolver
Indexer
Lexicon
Links
Doc Index
PageRank
Searcher
17
Conclusion
  • The suggested architecture satisfies
  • General requirements for a search engine
  • Specific requirements for VNese language

18
References
  • http//www.google.com
  • S. Brin,L. Page The Anatomy of a Large-Scale
    Hypertextual Web Search Engine. WWW7 /
    Computer Networks 30(1-7) 107-117 (1998)
  • Web search for a planet The Google cluster
    architecture, Luiz Andre Barroso, Jeffrey
    Dean, Urs Holzle.
  • Searching the Web, Arvind Arasu et al, ACM
    Transactions on Internet Technology, 2001.
  • http//www.infotoday.com/searcher
Write a Comment
User Comments (0)
About PowerShow.com