Title: HarvestMan for Accessibility Assessment
1HarvestMan for Accessibility Assessment
Workshop on web accessibility and meta-modeling
Agder University College, Grimstad, Norway April
15 2005
- Anand B Pillai
- abpillai_at_gmail.com
- (SETLABS, Infosys Technologies, Bangalore)
2Outline
- Introduction
- Applications (Uses)
- Architecture
- Protocols Tags
- Features
- Threading Architecture
- Thread co-operation
- Flow of information
- Modular design
- Modules
- HarvestMan in EIAO
- EIAO extensions
- Plans for a distributed version
- Distributed operation
- Distributed architecture
- Plans for EIAO
- Framework for developing web accessibility
applications
3HarvestMan Introduction
- HarvestMan is a web crawler program
- HarvestMan is a console application
- HarvestMan is written completely in the Python
programming language - HarvestMan is an open source project, released
under the GNU General Public License (GPL) - Version 1.4 is the current version available for
the public - Version 1.4.1 is the development version
- Project page is http//harvestman.freezope.org
- Development is hosted in http//developer.berlios.
de
4HarvestMan Applications (Uses)
- HarvestMan can be used to,
- Download files from a website or many websites
- Download files from websites matching certain
patterns (regular expressions) - Search a web site for keywords and download web
pages containing them
5HarvestMan - Architecture
- Fully Multithreaded
- Uses the Producer-Consumer design pattern with
co-operating thread classes and multiple queues - Highly Configurable
- Reads options from a text or xml configuration
file. - Supports upto 60 different kinds of configuration
option - Command-line options also supported
- Preferred way is to use the configuration files,
however. - Downloads organized into projects
- Each HarvestMan project has a unique name. It
also has a starting url and a download directory - HarvestMan writes a project file before the start
of a download using Python pickle protocol - The project file can be read back later to
continue or re-start an abandoned/finished project
6HarvestMan Protocols Tags
- Protocols
- Supports HTTP,FTP,File and Gopher protocols
- HTTPS support depends on the Python version used
(Supported for versions gt 2.3) - HTML Tags
- Parses and downloads links pointed by the
following tags - Hyperlinks of the form lta href...gt
- Image links of the form ltimg href...gt
- Image links of the form ltimg src...gt
- Links of the form ltlink href...gt
- Stylesheet links of the form ltlink
relstylesheet...gt - Server-side javascript links of the form ltscript
src...gt - Server-side java applets (.class files) of the
form ltapplet ...gt
7HarvestMan - Features
- Filters
- Filter urls based on regular expression patterns
- Supports patterns based on url file extensions,
url path components and server names - Sample url filter
- -.jpg-.doc-/exclude-this-path/
- Filter urls based on url scopes. Scopes
supported are, - Url depth scopes (Length of a url w.r.t the root
server or the parent url) - Url boundaries (Based on server names/i.p)
- Url extents (Based on url directories)
- Advertisement (Junk url) filter
- Version 1.4.1 has a full-fledged junk filter
which can filter out junk (advertisement/banner/fl
ash) urls
8HarvestMan Features (Contd.)
- Limits
- Maximum number of web servers can be specified
(in a multiserver project) - Maximum number of directories in a webserver can
be specified - Maximum number of files can be specified in a
given project - Limit can be set to the maximum size of a file
downloaded in a project - Time-limits can be set for a project
- Maximum number of simultaneous alive connections
(downloads) can be set
9HarvestMan Features (Contd.)
- Controls
- Obeys the Robot Exclusion Protocol (robots.txt)
used by certain servers Can be turned on/off - Priorities for urls can be specified
- Based on file extensions
- Based on server names/i.p
- Priorities can be specified in a range of (-5,5)
- HarvestMan will schedule download of urls with a
higher priority before those with a lower
priority - Sample priority specification setting
- jpg3,png-2,doc-5,html5
10HarvestMan Features (Contd.)
- Storage (Persistence)
- Files are saved to the disk and the website
recreated, preserving the original hyper link
structure of the website(s) - A cache file is created for every project
- The cache file is a binary file containing the
data metadata of all files downloaded in a
project - The cache file is written using Python pickle
protocol - The cache file consists of the url, timestamp(at
the web server), location on the disk and the
actual content, of all files downloaded during a
project. - Caching allows the program to download only the
urls that have been modified (at the web server)
when a project with a cache is re-run. - Caching can be turned on/off.
11HarvestMan Threading Architecture
- Consists of co-operating fetcher and crawler
threads - Fetcher threads do the job of actually
downloading files and saving them to the disk - Crawler threads do the job of parsing web page
data and extracting the list of urls to be
crawled according to the HarvestMan rules
limits specified in the configuration file
12HarvestMan Thread Co-operation
- Fetcher and crawler threads co-operate by
following the producer-consumer paradigm - HarvestMan uses a symmetric, synergic
producer-consumer design pattern - There are two queues for data flow A data queue
which stores raw web page data (html), and a url
queue which stores urls - Fetcher threads obtain their urls from the url
queue. They download the urls, save them to the
disk. If the url is a web page(html file) its
contents are posted to the data queue - Crawlers get their data from the data queue. They
parse the html data, get the new urls and post
them to the url queue - Thus fetchers are the consumers of the url queue
and producers for the data queue. Crawlers are
consumers of the data queue and producers for the
url queue. - This mutual producer-consumer dependency creates
a symmetric and synergic data flow - Apart from these thread types, there are
additional worker or slave threads to which
fetcher threads can delegate the actual job of
downloading files from urls.
13HarvestMan Flow of information
Data Queue
Post web-page data
Parse web-page data
Download files save them to disk
Get web-page data
Crawler threads
Fetcher Threads
Url Queue
Post urls
Get urls
Symmetric/Synergic Producer-Consumer threading
paradigm
14HarvestMan Modular Design
- HarvestMan is designed in a modular fashion, each
modules doing a specific task. - This facilitates greater re-use of the programs
code for other projects, such as EIAO for
example. - HarvestMan can be used as a general
framework/library for web crawling and specific
application functionality can be plugged in by
writing your application code in Python and
plugging it in at the right module of HarvestMan
15HarvestMan Modules (As of version 1.4.1)
- Crawler.py Code for fetcher/crawler threads
- Urlqueue.py Module containing the data/url
queues - Urlthread.py Code for worker/slave threads
- Pageparser.py High level web-page parser
- Urlparser.py Module to parse urls, get
information about them and construct the local
filename for the urls also handles relative urls - Config.py Holds all configuration options and
maintains state of the program - Connector.py Network connection configuration,
management and url downloads - Rules.py Applies HarvestMan scoping, limit,
control and filter rules to urls to decide
whether to download them or not - Datamgr.py Manages download requests from
different fetcher threads, downloads the urls and
maintains state information for downloaded urls
such as cache, url download status, statistical
information etc. - Utils.py A collection of utility functions and
classes - Xmlparser.py To parse HarvestMan xml config
file - Htmlparser.py Htmlparser module, borrowed from
Python library and customized for HarvestMan - Robotparser.py To manage robots.txt file rules
borrowed from Python library and customized for
HarvestMan - Strptime.py Pure Python strptime module to
write timestamps of downloads into the cache
files - Common.py Functions that dont fit anywhere
else are put here in the global namespace. - Harvestman.py Main application module
16HarvestMan in EIAO
- Crawler component of the EIAO ROBACC
- Crawl the web, obtaining URLs from the URL
repository - Applies scoping rules according to a scoping
scheme in the repository databases to limit the
number of links crawled - Store files downloaded and HTTP headers in the
local repository - Version 1.4.1 is being used for EIAO. It adds a
few new features - XML configuration option
- Advertisement (Junk url) filter
- Few performance enhancements
17HarvestMan EIAO extensions
- Persistency extensions
- Ability to load urls from a database repository
(Currently reads it from the config file) - Ability to save files and metadata such as HTTP
headers to a database repository (Currently saves
it to the file system) - Ability to load a scoping schema from a database
repository (Currently specified as rules in the
config file) - Most of the changes in datamgr.py module
- Url scoping extensions
- Temporal scoping
- Content-aware scoping
- Scoping rules should be dynamically modifiable
- Most of the changes to rules.py module
- Scheduling extensions
- A url scheduling extension/modification to the
current best-effort priority queue of urls. This
is to support temporal scoping
18HarvestMan Plans for a Distributed version
- Use multiple instances of the crawler running in
different machines - Scale out using multiple co-operating crawler and
fetcher instances on multiple machines instead of
the current scale-in architecture with multiple
threads in the same process - Use a master-slave kind of distributed
architecture with a master crawler running on a
central server and slave fetchers running on
slave machines - The crawler instance is a process which performs
the job of the existing crawler threads - Fetcher instances are slave processes which
performs the job of the existing fetcher threads - Communication is via distributed message queues
- Rules loaded from a central repository which is
modifiable in time.
19HarvestMan Distributed Operation
- Master (crawler instance) downloads the starting
url, parses it and gets the new urls - The new urls are send to the distributed url
queue - Fetcher instances are started up on slave
machines configured for it - Fetchers wait at the url queue and get the new
urls. They download the urls, save data/metadata
to a central repository - Web-page data is posted to a distributed data
queue by the fetchers - Crawler instance, gets web page data from the
data queue, parses it, gets the new urls - It then loads the scoping rules from a
repository, applies them to the urls and filters
out urls that dont satisfy the scoping scheme - Urls which pass the test are posted to the
distributed url queue - The process continues...
20HarvestMan Distributed version architecture
- Currently no code only a plan!
- A basic proof of concept implementation can be
done using Python Remote Objects (Pyro) as the
distributed computing middleware - Pyro provides a very simple RPC framework for
distributed Python programs - It supports a master/slave architecture
- Allows fast and easy porting of non-distributed
code to a basic distributed prototype - Written in pure Python no external dependencies
- Can also take a look at using tuple spaces
PyLinda provides a framework for this.
21HarvestMan Plans for EIAO
- EIAO version 1.0 (Proof of concept)
- Write the persistency, scoping and scheduling
extensions - Add any more plugins as needed
- EIAO version 2.0
- Use distributed HarvestMan architecture for more
efficiently distributed crawling tasks across
multiple machines in a cluster - Allow distributed fetching of scoping rules from
repositories - Should be fitting in with EIAO performance
requirements at this time.
22HarvestMan Framework for creating web
accessibility applications
- Pluggable design, hence very customizable
- Can be used as a framework for developing
applications that can have very specific
processing capabilities on top of the basic web
accessibility provided by HarvestMan - Suitable for teaching courses for doing web
mining or web accessibility application
development in universities
23Questions ?
24 - THANK YOU!
- Anand B Pillai
- abpillai_at_gmail.com
25References
- HarvesMan web crawler http//harvestman.freezope
.org - Pyro (Python Remote Objects) http//pyro.sourcef
orge.net