HarvestMan for Accessibility Assessment - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

HarvestMan for Accessibility Assessment

Description:

It also has a starting url and a download directory ... HarvestMan is designed in a modular fashion, each modules doing a specific task. ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 26

Provided by: Ana22

Category:

more less

Transcript and Presenter's Notes

Title: HarvestMan for Accessibility Assessment

1
HarvestMan for Accessibility Assessment
Workshop on web accessibility and meta-modeling
Agder University College, Grimstad, Norway April
15 2005

Anand B Pillai
abpillai_at_gmail.com
(SETLABS, Infosys Technologies, Bangalore)

2
Outline

Introduction
Applications (Uses)
Architecture
Protocols Tags
Features
Threading Architecture
Thread co-operation
Flow of information
Modular design
Modules
HarvestMan in EIAO
EIAO extensions
Plans for a distributed version
Distributed operation
Distributed architecture
Plans for EIAO
Framework for developing web accessibility
applications

3
HarvestMan Introduction

HarvestMan is a web crawler program
HarvestMan is a console application
HarvestMan is written completely in the Python
programming language
HarvestMan is an open source project, released
under the GNU General Public License (GPL)
Version 1.4 is the current version available for
the public
Version 1.4.1 is the development version
Project page is http//harvestman.freezope.org
Development is hosted in http//developer.berlios.
de

4
HarvestMan Applications (Uses)

HarvestMan can be used to,
Download files from a website or many websites
Download files from websites matching certain
patterns (regular expressions)
Search a web site for keywords and download web
pages containing them

5
HarvestMan - Architecture

Fully Multithreaded
Uses the Producer-Consumer design pattern with
co-operating thread classes and multiple queues
Highly Configurable
Reads options from a text or xml configuration
file.
Supports upto 60 different kinds of configuration
option
Command-line options also supported
Preferred way is to use the configuration files,
however.
Downloads organized into projects
Each HarvestMan project has a unique name. It
also has a starting url and a download directory
HarvestMan writes a project file before the start
of a download using Python pickle protocol
The project file can be read back later to
continue or re-start an abandoned/finished project

6
HarvestMan Protocols Tags

Protocols
Supports HTTP,FTP,File and Gopher protocols
HTTPS support depends on the Python version used
(Supported for versions gt 2.3)
HTML Tags
Parses and downloads links pointed by the
following tags
Hyperlinks of the form lta href...gt
Image links of the form ltimg href...gt
Image links of the form ltimg src...gt
Links of the form ltlink href...gt
Stylesheet links of the form ltlink
relstylesheet...gt
Server-side javascript links of the form ltscript
src...gt
Server-side java applets (.class files) of the
form ltapplet ...gt

7
HarvestMan - Features

Filters
Filter urls based on regular expression patterns
Supports patterns based on url file extensions,
url path components and server names
Sample url filter
-.jpg-.doc-/exclude-this-path/
Filter urls based on url scopes. Scopes
supported are,
Url depth scopes (Length of a url w.r.t the root
server or the parent url)
Url boundaries (Based on server names/i.p)
Url extents (Based on url directories)
Advertisement (Junk url) filter
Version 1.4.1 has a full-fledged junk filter
which can filter out junk (advertisement/banner/fl
ash) urls

8
HarvestMan Features (Contd.)

Limits
Maximum number of web servers can be specified
(in a multiserver project)
Maximum number of directories in a webserver can
be specified
Maximum number of files can be specified in a
given project
Limit can be set to the maximum size of a file
downloaded in a project
Time-limits can be set for a project
Maximum number of simultaneous alive connections
(downloads) can be set

9
HarvestMan Features (Contd.)

Controls
Obeys the Robot Exclusion Protocol (robots.txt)
used by certain servers Can be turned on/off
Priorities for urls can be specified
Based on file extensions
Based on server names/i.p
Priorities can be specified in a range of (-5,5)
HarvestMan will schedule download of urls with a
higher priority before those with a lower
priority
Sample priority specification setting
jpg3,png-2,doc-5,html5

10
HarvestMan Features (Contd.)

Storage (Persistence)
Files are saved to the disk and the website
recreated, preserving the original hyper link
structure of the website(s)
A cache file is created for every project
The cache file is a binary file containing the
data metadata of all files downloaded in a
project
The cache file is written using Python pickle
protocol
The cache file consists of the url, timestamp(at
the web server), location on the disk and the
actual content, of all files downloaded during a
project.
Caching allows the program to download only the
urls that have been modified (at the web server)
when a project with a cache is re-run.
Caching can be turned on/off.

11
HarvestMan Threading Architecture

Consists of co-operating fetcher and crawler
threads
Fetcher threads do the job of actually
downloading files and saving them to the disk
Crawler threads do the job of parsing web page
data and extracting the list of urls to be
crawled according to the HarvestMan rules
limits specified in the configuration file

12
HarvestMan Thread Co-operation

Fetcher and crawler threads co-operate by
following the producer-consumer paradigm
HarvestMan uses a symmetric, synergic
producer-consumer design pattern
There are two queues for data flow A data queue
which stores raw web page data (html), and a url
queue which stores urls
Fetcher threads obtain their urls from the url
queue. They download the urls, save them to the
disk. If the url is a web page(html file) its
contents are posted to the data queue
Crawlers get their data from the data queue. They
parse the html data, get the new urls and post
them to the url queue
Thus fetchers are the consumers of the url queue
and producers for the data queue. Crawlers are
consumers of the data queue and producers for the
url queue.
This mutual producer-consumer dependency creates
a symmetric and synergic data flow
Apart from these thread types, there are
additional worker or slave threads to which
fetcher threads can delegate the actual job of
downloading files from urls.

13
HarvestMan Flow of information

Data Queue
Post web-page data
Parse web-page data
Download files save them to disk
Get web-page data
Crawler threads
Fetcher Threads
Url Queue
Post urls
Get urls
Symmetric/Synergic Producer-Consumer threading
paradigm
14
HarvestMan Modular Design

HarvestMan is designed in a modular fashion, each
modules doing a specific task.
This facilitates greater re-use of the programs
code for other projects, such as EIAO for
example.
HarvestMan can be used as a general
framework/library for web crawling and specific
application functionality can be plugged in by
writing your application code in Python and
plugging it in at the right module of HarvestMan

15
HarvestMan Modules (As of version 1.4.1)

Crawler.py Code for fetcher/crawler threads
Urlqueue.py Module containing the data/url
queues
Urlthread.py Code for worker/slave threads
Pageparser.py High level web-page parser
Urlparser.py Module to parse urls, get
information about them and construct the local
filename for the urls also handles relative urls
Config.py Holds all configuration options and
maintains state of the program
Connector.py Network connection configuration,
management and url downloads
Rules.py Applies HarvestMan scoping, limit,
control and filter rules to urls to decide
whether to download them or not
Datamgr.py Manages download requests from
different fetcher threads, downloads the urls and
maintains state information for downloaded urls
such as cache, url download status, statistical
information etc.
Utils.py A collection of utility functions and
classes
Xmlparser.py To parse HarvestMan xml config
file
Htmlparser.py Htmlparser module, borrowed from
Python library and customized for HarvestMan
Robotparser.py To manage robots.txt file rules
borrowed from Python library and customized for
HarvestMan
Strptime.py Pure Python strptime module to
write timestamps of downloads into the cache
files
Common.py Functions that dont fit anywhere
else are put here in the global namespace.
Harvestman.py Main application module

16
HarvestMan in EIAO

Crawler component of the EIAO ROBACC
Crawl the web, obtaining URLs from the URL
repository
Applies scoping rules according to a scoping
scheme in the repository databases to limit the
number of links crawled
Store files downloaded and HTTP headers in the
local repository
Version 1.4.1 is being used for EIAO. It adds a
few new features
XML configuration option
Advertisement (Junk url) filter
Few performance enhancements

17
HarvestMan EIAO extensions

Persistency extensions
Ability to load urls from a database repository
(Currently reads it from the config file)
Ability to save files and metadata such as HTTP
headers to a database repository (Currently saves
it to the file system)
Ability to load a scoping schema from a database
repository (Currently specified as rules in the
config file)
Most of the changes in datamgr.py module
Url scoping extensions
Temporal scoping
Content-aware scoping
Scoping rules should be dynamically modifiable
Most of the changes to rules.py module
Scheduling extensions
A url scheduling extension/modification to the
current best-effort priority queue of urls. This
is to support temporal scoping

18
HarvestMan Plans for a Distributed version

Use multiple instances of the crawler running in
different machines
Scale out using multiple co-operating crawler and
fetcher instances on multiple machines instead of
the current scale-in architecture with multiple
threads in the same process
Use a master-slave kind of distributed
architecture with a master crawler running on a
central server and slave fetchers running on
slave machines
The crawler instance is a process which performs
the job of the existing crawler threads
Fetcher instances are slave processes which
performs the job of the existing fetcher threads
Communication is via distributed message queues
Rules loaded from a central repository which is
modifiable in time.

19
HarvestMan Distributed Operation

Master (crawler instance) downloads the starting
url, parses it and gets the new urls
The new urls are send to the distributed url
queue
Fetcher instances are started up on slave
machines configured for it
Fetchers wait at the url queue and get the new
urls. They download the urls, save data/metadata
to a central repository
Web-page data is posted to a distributed data
queue by the fetchers
Crawler instance, gets web page data from the
data queue, parses it, gets the new urls
It then loads the scoping rules from a
repository, applies them to the urls and filters
out urls that dont satisfy the scoping scheme
Urls which pass the test are posted to the
distributed url queue
The process continues...

20
HarvestMan Distributed version architecture

Currently no code only a plan!
A basic proof of concept implementation can be
done using Python Remote Objects (Pyro) as the
distributed computing middleware
Pyro provides a very simple RPC framework for
distributed Python programs
It supports a master/slave architecture
Allows fast and easy porting of non-distributed
code to a basic distributed prototype
Written in pure Python no external dependencies
Can also take a look at using tuple spaces
PyLinda provides a framework for this.

21
HarvestMan Plans for EIAO

EIAO version 1.0 (Proof of concept)
Write the persistency, scoping and scheduling
extensions
Add any more plugins as needed
EIAO version 2.0
Use distributed HarvestMan architecture for more
efficiently distributed crawling tasks across
multiple machines in a cluster
Allow distributed fetching of scoping rules from
repositories
Should be fitting in with EIAO performance
requirements at this time.

22
HarvestMan Framework for creating web
accessibility applications

Pluggable design, hence very customizable
Can be used as a framework for developing
applications that can have very specific
processing capabilities on top of the basic web
accessibility provided by HarvestMan
Suitable for teaching courses for doing web
mining or web accessibility application
development in universities

23
Questions ?