Title: Implementation Issues of Distributed Crawlers
1Implementation Issues of Distributed Crawlers
- A Review of Implementation Issues in
- Distributed Crawling Architectures
- by Ditesh Kumar Loh Jin Tiam
- Faculty of Computer Science
- and Information Technology
- Universiti Malaya
2Implementation Issues of Distributed Crawlers
- What are crawlers?
- Networked software systems that perform indexing
services of certain resources. - Most crawlers focus on web content (eg HTML
pages). - Generally serves as a backend to search engines.
3Implementation Issues of Distributed Crawlers
- Functionality of Crawlers
Download Resource
Store Resource
Retrieve Pointer
Perform Pointer Analysis
Store New Pointers
4Implementation Issues of Distributed Crawlers
- Issues
- Architectures
- Coverage, overlap and communication overhead
- Assignment of Responsibility
- Scalability
- URL Partitioning
5Implementation Issues of Distributed Crawlers
- Distributed Architectures
- One server, multiple crawlers
Server
crawler
crawler
crawler
6Implementation Issues of Distributed Crawlers
- Architectures (many crawlers-one server)
- Advantages
- Well known architecture, easy to code.
- Easy to partition responsibility.
- Easy to perform fault-tolerance (eg, if crawler
dies, it can be restarted). - Scalability not an issue.
- Process, performance monitoring becomes easier.
- Less network traffic.
7Implementation Issues of Distributed Crawlers
- Architectures (many crawlers-one server)
- Dis-advantages
- Potential bottleneck.
- Not practical when number of crawlers is large.
8Implementation Issues of Distributed Crawlers
- Architectures
- Peer-to-peer
crawler
crawler
crawler
9Implementation Issues of Distributed Crawlers
- Architectures (p2p)
- Advantages
- Scalable if coded properly.
- Automatic partitioning of responsibility.
- No central point of failure.
- Fault-tolerant (eg, if crawler dies, it can be
replaced by another crawler).
10Implementation Issues of Distributed Crawlers
- Architectures (p2p)
- Dis-advantages
- Hard to code properly.
- Network traffic relatively higher.
- Harder to collate information collected.
- Harder to monitor performance.
11Implementation Issues of Distributed Crawlers
- Balancing and Contravariance (URL Partitioning)
- First Rule Given a set of resources to crawl,
highest efficiency is attained if two crawlers
never crawl the same subset (balancing).
Crawler B
Crawler A
a.com
b.com
c.com
d.com
Crawler C
Crawler D
12Implementation Issues of Distributed Crawlers
- Balancing and Contravariance (ctd.)
- Second Rule If total resource size changes, then
the subset to be crawled by each crawler must
change accordingly (contravariance).
Crawler B
Crawler C
Crawler A
Crawler D
Crawler E
Crawler H
Crawler F
Crawler G
13Implementation Issues of Distributed Crawlers
- Balancing and Contravariance (ctd.)
- Four issues first, ensure all crawlers get
roughly an equal number of crawling assignments. - Second, ensure that the assignments never clash.
- Third, it must be possible for any crawler to
know who is responsible to crawl a particular
resource.
14Implementation Issues of Distributed Crawlers
- Balancing and Contravariance (ctd.)
- Fourth, each crawler must be able to adapt to new
crawling assignments (to ensure contravariance).
15Implementation Issues of Distributed Crawlers
- Balancing and Contravariance (ctd.)
- A good technique is hashing the resource name and
partitioning those hashes between the crawlers. - Papers suggest usage of domain names for hash
source. - Better way is to use IP address (benefit able to
geographically separate crawling disadvantage
reverse-DNS lookup).
16Implementation Issues of Distributed Crawlers
- Coverage, overlap, communication overhead
- Coverage is U/I
- Overlap is N/U
- Communication overhead is E/N
- Goal
- Maximize coverage
- Minimize overlap
- Minimize communication overhead
17Implementation Issues of Distributed Crawlers
- Assignment of Responsibility
- Firewall ignores resource links not in domain of
crawling agent (CA). Zero communication overhead
and overlap, but lower coverage. - Cross-over retrieves resources links not in
domain of CA. Zero communication overhead and
greater coverage but high overlap ratio. - Exchange send resource links to relevant CA.
Maximized coverage, reduced overlap but high
communication overhead.
18Implementation Issues of Distributed Crawlers
- Assignment of Responsibility
- Best solution combination of cross-over and
exchange models. - Cross-over model used when the communication
overhead far outweighs the actual retrieval
process. - Exchange model used when communication overhead
is just a fraction of the actual retrieval cost,
the exchange model is used. - Intuitively, we expect that the exchange model be
used most of the time.
19Implementation Issues of Distributed Crawlers
- Scalability
- Defined as the number of pages crawled per second
per crawler should be (almost) independent of the
number of crawlers. - Throughput grows linearly with the number of
agents (easy to measure performance). - Load must be handled transparently by cleanly
adding new agents in crawling swarm.
20Implementation Issues of Distributed Crawlers
- Scalability
- Crawling agents can have multiple threads. Google
uses 300 threads/agent, Mercator uses 100
threads/agent, Polytechnic uses 1000
threads/agent. - Issue is OS thread support vs fork()-ing support.
21Implementation Issues of Distributed Crawlers
- Conclusion
- Distributed crawlers are harder to
design/code/monitor. - Yet, distributed crawler architecture has
significant benefits that justify building it. - Issues that require further attention include
using time information for next-crawl prediction,
crawling relatively dynamic resources etc.