Title: Presentatie Francis Cave ACAP
1gtgtgtgt Communicating with crawlersWhat ACAP has
to offer
gtgtgtgt
Francis Cave, EDItEUR ACAP Technical Project
Manager May 2008
WEBCONTENT Te Mooi Om Weg Te GevenNUV, Amsterdam
2Communicating with crawlersWhat ACAP has to
offer
- What is ACAP (Version 1.0)?
- What has been the experience so far?
- What publishers should do now...
3The ACAP Technical FrameworkACAP Version 1.0
- What it is
- a toolkit to enable communication of content
access and usage policies - adopting and building upon existing standards
- rooted in the requirements of real use cases
- a proof of concept
- What it isnt
- at this stage ACAP is not a formal standard
- a technical enforcement mechanism
4The ACAP Technical FrameworkACAP Version 1.0
- What is it?
- Protocols for machine-to-machine messaging
- using a common vocabulary of access and usage
terminology - Guidance on methods of communication and access
control - Software tools to support implementation
5The ACAP Technical FrameworkACAP Version 1.0
- What kinds of protocols?
- business layer protocols
- machines already know how to talk to one
another - physical layer PPP, ATM,
- network layer TCP/IP
- application layer HTTP, HTTPS, SMTP, FTP
- business layer RSS, ebXML, EDIINT, SOAP, web
services, - they just dont know what to say in the business
of communicating access and usage policies
6The ACAP Technical FrameworkACAP Version 1.0
- We need to tell the machines what to say to one
another - we need a common vocabulary
- so they knows what to say
- and how to interpret it
- and tell them how to say it
- using whatever protocols they already use to
talk to one another
7The ACAP Technical FrameworkACAP Version 1.0
- But machines arent going to do this on their
own - we need to provide guidance on how to implement
the protocols - we need to provide tools to support implementation
8The ACAP Technical FrameworkACAP Version 1.0
- How has it been developed?
- We started with a set of real business use cases
- Nine publishers looking for ways of communicating
access and use policies for their online content - A national archive looking for ways of finding
out what they were allowed to do with the content
that they are preserving for posterity - A search engine looking for ways to include more
high-quality content in their index
9The ACAP Technical FrameworkACAP Version 1.0
- What does ACAP Version 1.0 include?
- Extensions to the Robots Exclusion Protocol (REP)
- Part 1 specifies extensions to the robots.txt
format - enables policies to be expressed for an entire
website - leverages the established protocol for web
server-crawler communication - the existing format is used on millions of
websites and understood by hundreds of crawlers - Part 2 specifies extensions to the Robots META
Tags format - enables policies to be expressed within
individual HTML pages - existing format understood by major search
engines - Dictionary of access and usage terminology
- robots.txt conversion tool
10The ACAP Technical FrameworkACAP Version 1.0
- Why does REP need to be extended?
- conventional REP has only a very limited
vocabulary - even if we include non-standard extensions that
not every search engine has implemented - conventional REP is inconsistently interpreted
- e.g. Disallow is interpreted differently means
different things to different crawlers - dont crawl?
- dont index?
11The ACAP Technical FrameworkACAP Version 1.0
- ACAP Version 1.0 has been tested by four
publishers against their priority use cases - De Persgroep major Flemish news publisher
- Media 24 global news / media publisher based in
South Africa - Macmillan online book content hosting service
- Reed Elsevier scientific and business
information publisher - all the tested use cases concern text resources
- current technical work includes extension of ACAP
Version 1.0 to enable communication of policies
relating specifically to non-text resource such
as images and video - ACAP Version 1.0 has been implemented in a test
crawler by search engine operator Exalead
12The ACAP Technical FrameworkACAP Version 1.0
- Tool for converting existing robots.txt files
- converts conventional robots.txt files so that
existing policies are expressed using ACAP
terminology - User-agent ? ACAP-crawler
- Disallow ? ACAP-disallow-crawl
- Allow ? ACAP-allow-crawl
- is implemented in perl
- can be used from the ACAP website
- http//www.the-acap.org/convert-robots-txt-to-acap
.php
13The ACAP Technical FrameworkACAP Version 1.0
- Guidance on crawler authentication
- How to identify crawler names and IP addresses by
analysing web server access log files - How to configure a server so that you can deliver
different robots.txt files to different
crawlers - examples are based upon the Apache web server
- ACAP Version 1.0 Implementation Guide
- Step-by-step guide on how to make full use of the
extensions to REP proposed in ACAP Version 1.0 - Illustrated with many examples
14The ACAP Technical FrameworkACAP Version 1.0
- Review of test results
- We have tested ACAP Version 1.0 REP extensions in
a range of use cases - for most of the tested use cases there are no
unresolved issues - but protected content use cases
- have been particularly challenging to implement
- have highlighted need for further work on some
terminology - ACAP Version 1.0 is ready to implement
- for use cases in unprotected online content
delivery - for some use cases in protected online content
delivery - but ACAP needs further development
- all specifications will continue to be revised
and extended
15The ACAP Technical FrameworkFuture plans
- To be added in future
- corrections and clarifications of a few points in
ACAP Version 1.0 - additional vocabulary required for expressing
policies specific to - the creation and use of web archives
- the presentation of images and other media
content - the communication of policies associated with
page fragments - mechanisms for embedding ACAP policies in PDF and
media resources. - an XML format for policy expression
- based upon ONIX for Licensing Terms developed by
EDItEUR - required for news and web syndication use cases
16The ACAP Technical FrameworkACAP Version 1.0
- Experience to date
- ACAP Version 1.0 works
- it enables a richer form of expression of
policies than is possible using conventional REP
... - it doesnt interfere with current crawler
activity ... - ... but it only goes so far.
- ACAP Version 1.0 needs to be extended
- ACAP Version 1.1 (June/July 2008)
17The ACAP Technical FrameworkACAP Version 1.0
- What should publishers do now?
- ACAP Version 1.0 needs to be implemented!
- use the conversion tool to convert existing
robots.txt files to use ACAP forms of
expression - use the Implementation Guide to refine policy
expressions - consider creating crawler-specific policies in
separate robots.txt files - give us you feedback, to help us improve future
versions of ACAP
18The ACAP Technical Framework
gtgtgtgt Thank you! Questions? francis_at_franciscave.
com
gtgtgtgt
Francis Cave, EDItEUR ACAP Technical Project
Manager May 2008
WEBCONTENT Te Mooi Om Weg Te GevenNUV, Amsterdam