HarvestMan for Accessibility Assessment - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

HarvestMan for Accessibility Assessment

Description:

It also has a starting url and a download directory ... HarvestMan is designed in a modular fashion, each modules doing a specific task. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 26
Provided by: Ana22
Category:

less

Transcript and Presenter's Notes

Title: HarvestMan for Accessibility Assessment


1
HarvestMan for Accessibility Assessment
Workshop on web accessibility and meta-modeling
Agder University College, Grimstad, Norway April
15 2005
  • Anand B Pillai
  • abpillai_at_gmail.com
  • (SETLABS, Infosys Technologies, Bangalore)

2
Outline
  • Introduction
  • Applications (Uses)
  • Architecture
  • Protocols Tags
  • Features
  • Threading Architecture
  • Thread co-operation
  • Flow of information
  • Modular design
  • Modules
  • HarvestMan in EIAO
  • EIAO extensions
  • Plans for a distributed version
  • Distributed operation
  • Distributed architecture
  • Plans for EIAO
  • Framework for developing web accessibility
    applications

3
HarvestMan Introduction
  • HarvestMan is a web crawler program
  • HarvestMan is a console application
  • HarvestMan is written completely in the Python
    programming language
  • HarvestMan is an open source project, released
    under the GNU General Public License (GPL)
  • Version 1.4 is the current version available for
    the public
  • Version 1.4.1 is the development version
  • Project page is http//harvestman.freezope.org
  • Development is hosted in http//developer.berlios.
    de

4
HarvestMan Applications (Uses)
  • HarvestMan can be used to,
  • Download files from a website or many websites
  • Download files from websites matching certain
    patterns (regular expressions)
  • Search a web site for keywords and download web
    pages containing them

5
HarvestMan - Architecture
  • Fully Multithreaded
  • Uses the Producer-Consumer design pattern with
    co-operating thread classes and multiple queues
  • Highly Configurable
  • Reads options from a text or xml configuration
    file.
  • Supports upto 60 different kinds of configuration
    option
  • Command-line options also supported
  • Preferred way is to use the configuration files,
    however.
  • Downloads organized into projects
  • Each HarvestMan project has a unique name. It
    also has a starting url and a download directory
  • HarvestMan writes a project file before the start
    of a download using Python pickle protocol
  • The project file can be read back later to
    continue or re-start an abandoned/finished project

6
HarvestMan Protocols Tags
  • Protocols
  • Supports HTTP,FTP,File and Gopher protocols
  • HTTPS support depends on the Python version used
    (Supported for versions gt 2.3)
  • HTML Tags
  • Parses and downloads links pointed by the
    following tags
  • Hyperlinks of the form lta href...gt
  • Image links of the form ltimg href...gt
  • Image links of the form ltimg src...gt
  • Links of the form ltlink href...gt
  • Stylesheet links of the form ltlink
    relstylesheet...gt
  • Server-side javascript links of the form ltscript
    src...gt
  • Server-side java applets (.class files) of the
    form ltapplet ...gt

7
HarvestMan - Features
  • Filters
  • Filter urls based on regular expression patterns
  • Supports patterns based on url file extensions,
    url path components and server names
  • Sample url filter
  • -.jpg-.doc-/exclude-this-path/
  • Filter urls based on url scopes. Scopes
    supported are,
  • Url depth scopes (Length of a url w.r.t the root
    server or the parent url)
  • Url boundaries (Based on server names/i.p)
  • Url extents (Based on url directories)
  • Advertisement (Junk url) filter
  • Version 1.4.1 has a full-fledged junk filter
    which can filter out junk (advertisement/banner/fl
    ash) urls

8
HarvestMan Features (Contd.)
  • Limits
  • Maximum number of web servers can be specified
    (in a multiserver project)
  • Maximum number of directories in a webserver can
    be specified
  • Maximum number of files can be specified in a
    given project
  • Limit can be set to the maximum size of a file
    downloaded in a project
  • Time-limits can be set for a project
  • Maximum number of simultaneous alive connections
    (downloads) can be set

9
HarvestMan Features (Contd.)
  • Controls
  • Obeys the Robot Exclusion Protocol (robots.txt)
    used by certain servers Can be turned on/off
  • Priorities for urls can be specified
  • Based on file extensions
  • Based on server names/i.p
  • Priorities can be specified in a range of (-5,5)
  • HarvestMan will schedule download of urls with a
    higher priority before those with a lower
    priority
  • Sample priority specification setting
  • jpg3,png-2,doc-5,html5

10
HarvestMan Features (Contd.)
  • Storage (Persistence)
  • Files are saved to the disk and the website
    recreated, preserving the original hyper link
    structure of the website(s)
  • A cache file is created for every project
  • The cache file is a binary file containing the
    data metadata of all files downloaded in a
    project
  • The cache file is written using Python pickle
    protocol
  • The cache file consists of the url, timestamp(at
    the web server), location on the disk and the
    actual content, of all files downloaded during a
    project.
  • Caching allows the program to download only the
    urls that have been modified (at the web server)
    when a project with a cache is re-run.
  • Caching can be turned on/off.

11
HarvestMan Threading Architecture
  • Consists of co-operating fetcher and crawler
    threads
  • Fetcher threads do the job of actually
    downloading files and saving them to the disk
  • Crawler threads do the job of parsing web page
    data and extracting the list of urls to be
    crawled according to the HarvestMan rules
    limits specified in the configuration file

12
HarvestMan Thread Co-operation
  • Fetcher and crawler threads co-operate by
    following the producer-consumer paradigm
  • HarvestMan uses a symmetric, synergic
    producer-consumer design pattern
  • There are two queues for data flow A data queue
    which stores raw web page data (html), and a url
    queue which stores urls
  • Fetcher threads obtain their urls from the url
    queue. They download the urls, save them to the
    disk. If the url is a web page(html file) its
    contents are posted to the data queue
  • Crawlers get their data from the data queue. They
    parse the html data, get the new urls and post
    them to the url queue
  • Thus fetchers are the consumers of the url queue
    and producers for the data queue. Crawlers are
    consumers of the data queue and producers for the
    url queue.
  • This mutual producer-consumer dependency creates
    a symmetric and synergic data flow
  • Apart from these thread types, there are
    additional worker or slave threads to which
    fetcher threads can delegate the actual job of
    downloading files from urls.

13
HarvestMan Flow of information

Data Queue
Post web-page data
Parse web-page data
Download files save them to disk
Get web-page data
Crawler threads
Fetcher Threads
Url Queue
Post urls
Get urls
Symmetric/Synergic Producer-Consumer threading
paradigm
14
HarvestMan Modular Design
  • HarvestMan is designed in a modular fashion, each
    modules doing a specific task.
  • This facilitates greater re-use of the programs
    code for other projects, such as EIAO for
    example.
  • HarvestMan can be used as a general
    framework/library for web crawling and specific
    application functionality can be plugged in by
    writing your application code in Python and
    plugging it in at the right module of HarvestMan

15
HarvestMan Modules (As of version 1.4.1)
  • Crawler.py Code for fetcher/crawler threads
  • Urlqueue.py Module containing the data/url
    queues
  • Urlthread.py Code for worker/slave threads
  • Pageparser.py High level web-page parser
  • Urlparser.py Module to parse urls, get
    information about them and construct the local
    filename for the urls also handles relative urls
  • Config.py Holds all configuration options and
    maintains state of the program
  • Connector.py Network connection configuration,
    management and url downloads
  • Rules.py Applies HarvestMan scoping, limit,
    control and filter rules to urls to decide
    whether to download them or not
  • Datamgr.py Manages download requests from
    different fetcher threads, downloads the urls and
    maintains state information for downloaded urls
    such as cache, url download status, statistical
    information etc.
  • Utils.py A collection of utility functions and
    classes
  • Xmlparser.py To parse HarvestMan xml config
    file
  • Htmlparser.py Htmlparser module, borrowed from
    Python library and customized for HarvestMan
  • Robotparser.py To manage robots.txt file rules
    borrowed from Python library and customized for
    HarvestMan
  • Strptime.py Pure Python strptime module to
    write timestamps of downloads into the cache
    files
  • Common.py Functions that dont fit anywhere
    else are put here in the global namespace.
  • Harvestman.py Main application module

16
HarvestMan in EIAO
  • Crawler component of the EIAO ROBACC
  • Crawl the web, obtaining URLs from the URL
    repository
  • Applies scoping rules according to a scoping
    scheme in the repository databases to limit the
    number of links crawled
  • Store files downloaded and HTTP headers in the
    local repository
  • Version 1.4.1 is being used for EIAO. It adds a
    few new features
  • XML configuration option
  • Advertisement (Junk url) filter
  • Few performance enhancements

17
HarvestMan EIAO extensions
  • Persistency extensions
  • Ability to load urls from a database repository
    (Currently reads it from the config file)
  • Ability to save files and metadata such as HTTP
    headers to a database repository (Currently saves
    it to the file system)
  • Ability to load a scoping schema from a database
    repository (Currently specified as rules in the
    config file)
  • Most of the changes in datamgr.py module
  • Url scoping extensions
  • Temporal scoping
  • Content-aware scoping
  • Scoping rules should be dynamically modifiable
  • Most of the changes to rules.py module
  • Scheduling extensions
  • A url scheduling extension/modification to the
    current best-effort priority queue of urls. This
    is to support temporal scoping

18
HarvestMan Plans for a Distributed version
  • Use multiple instances of the crawler running in
    different machines
  • Scale out using multiple co-operating crawler and
    fetcher instances on multiple machines instead of
    the current scale-in architecture with multiple
    threads in the same process
  • Use a master-slave kind of distributed
    architecture with a master crawler running on a
    central server and slave fetchers running on
    slave machines
  • The crawler instance is a process which performs
    the job of the existing crawler threads
  • Fetcher instances are slave processes which
    performs the job of the existing fetcher threads
  • Communication is via distributed message queues
  • Rules loaded from a central repository which is
    modifiable in time.

19
HarvestMan Distributed Operation
  • Master (crawler instance) downloads the starting
    url, parses it and gets the new urls
  • The new urls are send to the distributed url
    queue
  • Fetcher instances are started up on slave
    machines configured for it
  • Fetchers wait at the url queue and get the new
    urls. They download the urls, save data/metadata
    to a central repository
  • Web-page data is posted to a distributed data
    queue by the fetchers
  • Crawler instance, gets web page data from the
    data queue, parses it, gets the new urls
  • It then loads the scoping rules from a
    repository, applies them to the urls and filters
    out urls that dont satisfy the scoping scheme
  • Urls which pass the test are posted to the
    distributed url queue
  • The process continues...

20
HarvestMan Distributed version architecture
  • Currently no code only a plan!
  • A basic proof of concept implementation can be
    done using Python Remote Objects (Pyro) as the
    distributed computing middleware
  • Pyro provides a very simple RPC framework for
    distributed Python programs
  • It supports a master/slave architecture
  • Allows fast and easy porting of non-distributed
    code to a basic distributed prototype
  • Written in pure Python no external dependencies
  • Can also take a look at using tuple spaces
    PyLinda provides a framework for this.

21
HarvestMan Plans for EIAO
  • EIAO version 1.0 (Proof of concept)
  • Write the persistency, scoping and scheduling
    extensions
  • Add any more plugins as needed
  • EIAO version 2.0
  • Use distributed HarvestMan architecture for more
    efficiently distributed crawling tasks across
    multiple machines in a cluster
  • Allow distributed fetching of scoping rules from
    repositories
  • Should be fitting in with EIAO performance
    requirements at this time.

22
HarvestMan Framework for creating web
accessibility applications
  • Pluggable design, hence very customizable
  • Can be used as a framework for developing
    applications that can have very specific
    processing capabilities on top of the basic web
    accessibility provided by HarvestMan
  • Suitable for teaching courses for doing web
    mining or web accessibility application
    development in universities

23
Questions ?

24
  • THANK YOU!
  • Anand B Pillai
  • abpillai_at_gmail.com

25
References
  • HarvesMan web crawler http//harvestman.freezope
    .org
  • Pyro (Python Remote Objects) http//pyro.sourcef
    orge.net
Write a Comment
User Comments (0)
About PowerShow.com