Title: Web crawler
 1 Web crawler
  2Member Group
- 1. Thanida Limsirivallop 
-  47541164 
- 2. Lucksamon Sivapattarakumpon 47541404 
- 3. Patrapee Suwannan 
-  47542212 
- 4. Rataruch Tongpradith 
-  47542246 
- Website http//pirun.ku.ac.th/b4754116
3WebCrawler Definition
-  Crawler is an automatic program (sometimes 
 called a "robot") which explores the World Wide
 Web, following the links and searching for
 information or building a database such programs
 are often used to build automated indexes for the
 Web, allowing users to do keyword searches for
 Web documents
-  Web crawlers are programs that exploit the graph 
 structure of the Web to move from page to page.
4Crawling The Web
-  Beginning -gt a key motivation for designing 
 Web crawlers has been to retrieve Web pages and
 add them or their representations to a local
 repository.
-  Simplest Form -gt A crawler starts from a seed 
 page and then uses the external links within it
 to attend to other pages. The process repeats
 with the new pages ordering more external links
 to follow, until a sufficient number of pages are
 identified or some higher level objective is
 reached.
-  General purpose search engines 
-  Serving as entry points to Web pages strive 
 for coverage that is as broad as possible. They
 use Web crawlers to maintain their index
 databases amortizing the cost of crawling and
 indexing over the millions of queries received by
 them.
5Crawling Infrastructure
 Basic Sequence Crawler 
 6Web Crawler Requirement
- The goal of the proposed crawler is to re-create 
 the look and feel of a website as it existed on
 the crawl date.
- The tool should be extensible to adapt to future 
 changes in web standards.
- General Requirements 
- Comprehensive downloading (Saving)  the look and 
 feel of the page are mirrored exactly, down to
 every image, link, and dynamic element.
- Scope of the crawl, Difficult link and depth. 
- An intuitive and extensible interface (Interface) 
 The crawler should use an intuitive graphical
 user interface
- Command line interface, Caching Pages and Rights 
- Security of the server and browser (Niceness)  
 The robots exclusion protocol must be obeyed.
 This requires downloading the robots.txt file
 before crawling the rest of the website.
7Web Crawler Requirement
- Presentation of dynamic elements (Dynamic page  
 image)  The crawler must download
 Shockwave-Flash files and other content listed in
 EMBED tags. Care must be taken not to load the
 same file multiple times
- Accurate look and feel (Presentation)  The most 
 important aspect of archiving web sites, all
 links should be accurate. Almost always,
 re-crawling an archive is needed to rewrite all
 links to be internal.
8Dominos A New Web Crawlers Design
-  This paper describes the design and 
 implementation of a realtime distributed system
 of Web crawling running on a cluster of machines
 and introduced a high availability system of
 crawling called Dominos.
-  Dominos is a dynamic system which accounts for 
 its highly flexible deployment, maintainability
 and enhanced fault tolerance. And finally this
 paper discusse the experimental results obtained,
 comparing them with other documented systems.
-  
9An investigation of web crawler behavior 
characterization and metrics
-  This paper presents a characterization study of 
 search-engine crawlers. The propose of this paper
 is using Web-server access logs from five
 academic sites in three different countries.
-  There are results and observations that provide 
 useful insights into crawler behavior and serve
 as basis of our ongoing work on the automatic
 detection of Web crawlers.
10Crawler-Friendly Web Servers
-  This paper studies how to make web servers more 
 crawler friendly and to evaluate simple and
 easy-to-incorporate modications to web servers so
 that there are signicant bandwidth savings.
-  This paper proposes that web servers can export 
 meta-data describing their pages so that crawlers
 can eciently create and maintain large, fresh
 repositories.
11The Evolution of the Web andImplications for an 
Incremental Crawler
-  This paper studies how to build an eective 
 incremental crawler. The crawler selectively and
 incrementally updates its index and/or local
 collection of web pages, instead of periodically
 refreshing the collection in batch mode.
-  Based on the results, discussing various design 
 choices for a crawler and the possible
 trade-offs. And then proposed an architecture for
 an incremental crawler, which combines the best
 strategies identied.
-  
12SharpSpider Spidering the Web through Web 
Services
-  This paper presents that SharpSpider, a 
 distributed, C spider designed to address the
 issues of scalability, decentralisation and
 continuity of a Web crawl.
-  Fundamental to the design of SharpSpider is the 
 publication of an API for use by other services
 on the network. Such an API grants access to a
 constantly refreshed index built after successive
 crawls of the Web.
13The Anatomy of a Large-Scale HypertextualWeb 
Search Engine
-  This paper presents Google, a prototype of a 
 large-scale search engine which makes heavy use
 of the structure present in hypertext. This paper
 provides an in-depth description of large-scale
 web search engine.
-  Google is designed to be a scalable search 
 engine. The primary goal is to provide high
 quality search result over a rapidly growing
 World Wide Web. Furthermore, Google is a complete
 architecture for gathering web pages, indexing
 them, and performing search queries over them.
14Incremental Web SearchTracking Changes in the 
Web
-  This paper presents the algorithms and 
 techniques useful for solving problem that is
 detecting web pages, extracting of web pages and
 evaluating of web change.
-  This paper presents an application using the 
 techniques and algorithms that named Web Daily
 News Assistant (WebDNA)  Currently deployed on
 NYU web site.
-  Model the change of web documents using 
 survival analysis. Modeling web changes is useful
 for web crawler scheduling and web caching.
15An Investigation of Documents from the World Wide 
Web
-  This paper reports on examination of pages from 
 WWW and there are analyzing data collected by the
 Inktomi Web Crawler. There are many analysis of
 HTML such as Evolution, Improving Web Content,
 Control of HTML, Sociological insights, User
 Studies, Content analyses, and structure
 analysis. And there are many tool to perform the
 data collection.
16A Crawler-based Study of Spyware on the Web
- Crawling the Web, downloading content from a 
 large number of sites, and then analyzing it to
 determine whether it is malicious. In this way,
 we can answer several important questions. For
 example
- How much spyware is on the Internet? 
- Where is that spyware located (e.g., game sites, 
 childrens
-  sites, adult sites, etc.) 
-  - How likely is a user to encounter spyware 
 through random browsing?
- What kinds of threats does that spyware pose? 
- What fraction of executables on the Internet are 
 infected
- with spyware? 
17Estimating Frequency of Change
- estimating the change frequency of data to 
 improve Web crawlers, Web caches and to help data
 mining by developing several frequency estimators
 and identifying various scenarios.
18Collaborative Web Crawler over High-speed 
Research Network
- Distribute web crawler that utilizes the existing 
 research networks.
- Distributed web crawling is a distributed 
 computing technique whereby Internet search
 engines employ many computers to index the
 Internet via web crawling. The idea is to spread
 out the required resources of computation and
 bandwidth to many computers and networks.
19Crawling-based Classification
-  The categorization of a database is determined 
 by its distribution of documents across
 categories.
20Mercator A Scalable, Extensible Web Crawler
- design features a crawler core for handling the 
 main crawling tasks, and extensibility through
- protocol and processing modules. 
- Users may supply new modules for performing 
 customized crawling tasks.
- We have used Mercator for a variety of purposes, 
 including performing random walks on the web,
 crawling our corporate intranet, and collecting
 statistics about the web at large.
21Parallel Crawlers
- A paralle crawler is a crawler that runs multiple 
 processes in parallel. The goal is to maximize
 the download rate while minimizing the overhead
 from parallelization and to avoid repeated
 downloads of the same page.
- To avoid downloading the same page more than 
 once, the crawling system requires a policy for
 assigning the new URLs discovered during the
 crawling process, as the same URL can be found by
 two different crawling processes.
22Efficient Crawling Through URL Ordering
- Define several different kinds of importance 
 metrics, and built three models to evaluate
 crawlers.
- Then evaluated several combinations of importance 
 and ordering metrics, using the Stanford Web
 pages.
23Efficient URL Caching for World Wide Web 
Crawling
- URL caching is very effective 
- Any web crawler must maintain a collection of 
 URLs that areto be downloaded. Moreover, since it
 would be unacceptable to download the same URL
 over and over,
- recommend a cache size of between 100 to 500 
 entries per crawling thread
- the size of the cache needed to achieve top 
 performance depends on the number of threads
24Multicasting a Web Repository
- proposing an alternative to multiple crawlers A 
 single central crawler builds a database of Web
 pages, and provides a multicast service for
 clients that need a subset of this Web image.
25Distributed High-performance Web Crawlers
- Distributing the workload across multiple 
 machines
- by divide and/or duplicate these pieces in the 
 cluster.
- The program will run simultaneously on two or 
 more computers that are communicating with each
 other over a network.
26Parallel Crawling for Online Social Networks
- a centralized queue implemented as a database 
 table is conveniently used to coordinate the
 operation of all the crawlers to prevent
 redundant crawling.
- This offers two tiers of parallelism, allowing 
 multiple crawlers to be run on each of the
 multiple agents, where the crawlers are not
 affected by any potential failing of the other
 crawlers.
27Finding replicated web collections
- Improving web crawling by avoiding redundant 
 crawling in the Google system and proposing a new
 algorithm for efficiently identifying similar
 collections that form what we call a similar
 cluster.
28Learnable Web Crawler
- In this section, we will shortly explain a 
 characteristic of the web crawler, learnable
 ability. We build some knowledge bases from the
 previous crawling. These knowledge bases are
-  Seed URLs 
-  Topic Keywords 
-  URL Prediction 
29The Algorithm No KB
-  Crawling_with_no_KB (topic)  
-  seed_urls  Search (topic, t) 
-  keywords  topic 
-  foreach url (seed_urls)  
-  url_topic  url.title  url.description 
-  url_score  sim (keywords, url_topic) 
-  enqueue (url_queue, url, url_score)  
-  while (url (url_queue) gt 0)  
-  url  dequeue_url_with_max_score 
 (url_queue)
-  page  fetch_new_document (url) 
-  page_score  sim (keywords, page) 
-  foreach link (extract_urls (page))  
-  link_score  a.sim(keywords,link.anchortext)
 
-   (1-a).page_score 
-  enqueue (url_queue, link, link_score) 
-   
-   
-   
30The Algorithm With KB
-  Crawling_with_KB (KB, topic)  
-  seed_urls  get_seed_url(KB,topic,t) 
-  keywords  get_topic_keyword(KB,topic) 
-  foreach url (seed_urls)  
-  url_score  get_pred_score(KB,topic,url) 
-  enqueue (url_queue, url, url_score)  
-  while (url (url_queue) gt 0)  
-  url  dequeue_url_with_max_score(url_queue) 
-  page  fetch_new_document (url) 
-  page_score  sim (keywords, page) 
-  foreach link (extract_urls (page))  
-  pred_link_score  get_pred_score(KB,topic,url
 )
-  link_score  a.(ß.sim(keywords,link.anchortex
 t)
-   (1- ß).pred_link_score) 
-  (1-a) . page_score 
-  enqueue (url_queue, link, link_score) 
-   
-    
31Overall Process
-  Learnable_Crawling (topic) 
-   
-  if (no KB)  
-  Collection  Crawling_with_no_KB (topic) 
-   
-  else  
-  Collection  Crawling_with_KB (KB, topic) 
-   
-  / To learn the previous crawling / 
-  KB.seed_urls  learn_seed_URL (Collection) 
-  KB.keywords  learn_topic_keyword 
 (Collection)
-  KB.url_predict  learn_URL_prediction(Collectio
 n)
-  
32Learning Analysis 
 33(No Transcript) 
 34  THANK YOU FOR YOUR ATTENTION .