Title: Vo Pham Tra My
1Search Engine for Vietnamese language
2Contents
- Introduction
- What is expected ?
- Solution for general requirements
- Search engines architecture for general
requirements - Solution for the language problem
- Search engines complete architecture
- Conclusion
- References
3Introduction (1/2)
- Internet is the richest source of information
- Search engine is the most useful gateway to data
- Powerful and famous search engines Google,
Yahoo, AllTheWeb, Altavista - Capability of multilingual searching
- Limitations for countries whose
- Languages are different than English
- Markets are small
4Introduction (2/2)
- Example
- Search engine Google
- Language Vietnamese
- Facts
- Result is not complete and not high-quality.
- This presentation aims at
- Suggesting a general design of search engine
which is specific for VNese language - With reference to Googles architecture
- With supplements for specific characteristics in
the language.
5What is expected ? (1/2)
- General requirements
- Features
- File types text, HTML, pdf, image.
- Cached links snapshot of a page as being crawled
- Similar pages display pages that are related to
a particular result - Quality
- Most relevant pages on top
- Up-to-date
- Interface
- Similar to Google simple and easy to everyone
- Homepage has a textbox and a Search button
- Result page shows 10 links a time, descending
order - Commerce
- Pay-per-placement customer buys keyword(s) to
have their link at the top of result list, when
that word(s) are seached - Database sell sell database to other engines
- Advertisement display advertisement
banners/links related to keyword
6What is expected ? (2/2)
- Requirement with the language problem
- Complete, high-quality result
-
7Solution for general requirements (1/2)
- Features
- File types text, HTML, pdf, image.
- For text documents craw documents, read the
files, create index for words in documents - For image craw the file, use specific tool to
retrieve contents utilize anchor text, create
index - Cached links snapshot of a page as being crawled
- For each result, provide a cached link that
leads user to the corresponding page stored in
engines repository. - Similar pages display pages that are related to
a particular result - Utilize the link graph of web pages
8Solution for general requirements (2/2)
- Quality
- Most relevant pages on top
- Use 2 measures
- IR score
- shows how important a word is to a page.
- calculated by Indexer
- PageRank
- shows how important a page is, in global rank.
- utilizes the link structure of web pages
- Up-to-date
- Repository and Index is updated once a week
9Search engines architecture
StoreServer
URLServer
Anchors
Repository
URLResolver
Indexer
Lexicon
Links
Doc Index
PageRank
Searcher
10Solution for language problem (1/6)
- Characteristics of the language
- 11 Vowels a, a, â, e, ê, o, ô, o, u, u, i.
- Consonants similar to English alphabet
- Extra d
- No f, j, w, z
- 5 accents
- Charset must be used to correctly display VNese
characters
.
?
11Solution for language problem (2/6)
- Why is search result not complete ?
- No official charset is imposed for VNese web
pages - Even some web pages dont use any charset
- For those that use, there are various options
Unicode, VNI, TCVN, Telex, VIQR - Some web pages may not appear in result list
since they use charsets different than that used
in database - Existing search engines
- Google doesnt pay attention on charset problem
- Vnese search engines does, but dont deal with
another problem
12Solution for language problem (3/6)
- Why is search result not complete ?
- Anchor text is an important source of information
- Anchor text the text of link
- Ex lthtmlgt /hotnews.html lt\htmlgt
- Most VNese web pages have English anchor texts
(short but informative) -
- engine creates different indices for different
words whose meanings are the same
13Solution for language problem (4/6)
Unicode TCVN VNI
Converter
Telex VIQR
Words with accents Unicode
Words without accents
StoreServer
URLServer
Repository
14Solution for language problem (5/6)
Key words with (or without) accents
Converter
If key words has accents, keyword
with accents, without accents Else
keyword without accents
Searcher
15Solution for language problem (6/6)
- Anchor text
- Use a table to replace English words with
- corresponding Vietnamese words
16Search engines complete architecture
URLServer
StoreServer
Converter
Repository
Anchors
URLResolver
Indexer
Lexicon
Links
Doc Index
PageRank
Searcher
17Conclusion
- The suggested architecture satisfies
- General requirements for a search engine
- Specific requirements for VNese language
18References
- http//www.google.com
- S. Brin,L. Page The Anatomy of a Large-Scale
Hypertextual Web Search Engine. WWW7 /
Computer Networks 30(1-7) 107-117 (1998) - Web search for a planet The Google cluster
architecture, Luiz Andre Barroso, Jeffrey
Dean, Urs Holzle. - Searching the Web, Arvind Arasu et al, ACM
Transactions on Internet Technology, 2001. - http//www.infotoday.com/searcher