Title: Greenstone Digital Library
1Greenstone Digital Library
- http//www.greenstone.org/
Speaker Su-Hsien Huang E-mail
sshuang_at_cis.nctu.edu.tw Ext 56647 Department of
Computer and Information Science
2Contents
- Introduction
- Installation
- Greenstone Demo
- The Collector
3Introduction
- Greenstone
- produced by the New Zealand Digital Library
Project at the University of Waikato, and
developed and distributed in cooperation with
UNESCO and the Human Info NGO. - open-source, multilingual software
- a suite of software for building and distributing
digital library collections. - a comprehensive system for constructing and
presenting collections of thousands or millions
of documents, including text, images, audio and
video. - Download http//prdownloads.sourceforge.net/green
stone/gsdl-2.39-win32.exe - Collection
- A set of documents with metadata
- a typical digital library built with Greenstone
will contain many collections - search for particular words
- browse documents by subject
4Finding Information
- Full-text indexes
- searched for particular words, combinations of
words, or phrases, and results are ordered - Metadata
- descriptive data such as author, title, date,
keywords, and so on, is associated with each
document
5Document Formats
- Greenstone support several kinds of document
formats - Plain text
- HTML
- WORD
- PDF
- Usenet
- E-mail messages.
- Converted into a standard XML form for indexing
6Multimedia and Multilingual Documents
- Non-textual material is either linked into the
textual documents or accompanied by textual
descriptions - To allow full-text searching and browsing.
- Encode by Unicode
- a standard scheme for representing the character
sets used in the worlds languages, is used - Arabic, Chinese, English, French, Mäori and
Spanish
7Installation
- Installation
- Uninstallation
8Version
9Directory
10Directory
11Directory
12Documentation
- Greenstone Digital Library Installers Guide
- Greenstone Digital Library Users Guide
- Greenstone Digital Library Developers Guide
- Greenstone Digital Library From Paper to
Collection
13Greenstone Demo
14Icons at The Top
- This takes you to the about page
- This takes you to the Digital Librarys home
page, from which you can select another
collection - This provides help text similar to what you are
reading now - This allows you to set some user interface and
searching options that will then be used
henceforth
15Icons on The Search/Browse Bar
- Search for particular words
- Access publications by subject
- Access publications by title
- Access publications by organization
- Access publications by how to listing
16A Book in The DemoCollection
17Browsing Icons
- Click on a book icon to read the corresponding
book - Click on a bookshelf icon to look at books on
that subject - View this document
- Open this folder and view contents
- Click on this icon to close the book
- Click on this icon to close the folder
- Click on the arrow to go on to the next section
...... or back to the previous section - Open this page in a new window
- Expand table of contents
- Display all text
- Highlight search terms
18Searching
- Twenty matching documents will be shown.
- A maximum of 100 is imposed on the number of
documents returned. - You can change these numbers by clicking the
preferences button at the top of the page - Each search term contains nothing but alphabetic
characters and digits. - Terms are separated by white space.
- punctuations are ignored.
- For example
- Agro-forestry in the Pacific Islands Systems
for Sustainability (1993) - will be treated the same as
- Agro forestry in the Pacific Islands Systems
for Sustainability 1993
19Searching
- Query type
- all the words.
- some of the words.
- Documents are displayed in order of how closely
they match the query. - Scope of queries
- author or title
- Chapter or paragraph or whole document
20Advanced search features
- Set in preferences button
- Case sensitivity and stemming
- African building will be treated the same as
africa builds - Phrase searching
- post-retrieval scan?
- Advanced query mode
- logical operators (and), (or), and ! (not)
- AND, OR, and NOT are treated as ordinary search
terms - Using search history
21Changing the preferences
- Collection preferences
- Language preferences
- Interface preference
- English, Chinese, .
- Encoding
- UTF-8, Big5,
- Interface format
- Graphical, textual
22The Preference Page
23The Collector
- The Collector is a facility that helps you create
new collections, modify or add to existing ones,
or delete collections. - create a new collection with the same structure
as an existing one - create a new collection with a different
structure from existing ones - add new material to an existing collection
- modify the structure of an existing collection
- delete a collection and
- write an existing collection to a self-contained,
self-installing CDROM.
24The collection
- The structure of a particular collection is
determined when the collection is set up. This
includes - the format of the source documents
- how they should be displayed on the screen
- the source of metadata
- what browsing facilities should be provided
- what full-text search indexes should be provided
- and how the search results should be displayed.
25Logging In
- Default log in name is Admin
- Password will be set up in installation
26Dialog structure
- After logging in, sequence of steps are involved
in collection building - Collection information
- specify the collections name and associated
information - Source data
- where the source data is to come from.
- Configuring the collection
- adjust the configuration options
- Building the collection
- makes all the indexes and gathers together any
other information that is required to make the
collection - Viewing the collection.
27Collection Information
- Title
- Contact E-mail address
- Brief description.
28Collection Information
- Configure file is in "GSDLHOME\collect\colname
\colname.cfg - Attribute
- Name collectionmeta collectionname
- Icon collectionmeta iconcollection
- Description collectionmeta collectionextra
29Source Data
- Collection may contain
- HTML documents (.htm, .html)
- plain text documents (.txt, .text)
- Microsoft Word documents (.doc),
- PDF documents (.pdf)
- E-mail documents (.email).
-
30Source Data
- There are three kinds of specification
- a directory name on the Greenstone server system
(beginning with file//) - An address beginning with http// for files to
be downloaded from the web - an address beginning with ftp// for files to
be downloaded using anonymous FTP. - Sources might be unavailable because
- the file, FTP site or URL does not exist
- you need to dial up your ISP first
- you are trying to access a URL from behind a
firewall.
31Working with Existing Collections
- To work with an existing collection, you first
select the collection from a list that is
provided. - Some collections are write protected
- With the collection, you can
- Add more data and rebuild the collection
- Edit the collection configuration file
- Delete the collection entirely
- Export the collection to CD-ROM.
32Document Formats
- When building collections, Greenstone processes
each different format of source document by
seeking a plugin that can deal with that
particular format. - Plugins are specified in the collection
configuration file. - Greenstone generally uses the filename to
determine document formats - foo.txt is processed as a text file
- foo.html as HTML
- foo.doc as a Word file.
33Document Formats
- TEXTPlug (.txt, .text)
- adds title metadata based on the first line of
the file. - HTMLPlug .htm, .html, .shtml, .shm, .asp, .php,
.cgi ) - It extracts title metadata based on the lttitlegt
tag - other metadata expressed using HTMLs metatag
syntax can be extracted too. - WORDPlug (.doc)
- uses independent programs to convert Word files
to HTML - PDFPlug (.pdf)
- Use pdftohtml, to convert PDF files to HTML.
- PSPlug (.ps)
- Use a standard Linux program, called ps2ascii, to
convert - EMAILPlug (.email)
- The plugin extracts Subject, To, From, and Date
metadata - this plugin does not yet handle MIME-encoded
E-mails properly - ZIPPlug (.gz, .z, .tgz, .taz, .bz, .zip, .tar)
- It relies on the programs gunzip, bunzip, unzip,
and tar, which are standard Linux utilities - ZIPPlug is disabled on Windows computers.
34Importing Process
- convert documents from their native format into
the Greenstone Archive Format used within
Greenstone - write a summary file (called archives.inf) which
will be used when the collection is built.
35(No Transcript)
36Configuring the Collection
37Collection Configuration File
38Collection Configuration File
- Indexes determine what collection indexes are
created - Indexes can be constructed at the document,
section, and paragraph levels. - Example
-
- Collectionextra
- describing the collection.
- collectionmeta collectionextra "collection
description" - collectionmeta collectionextra lfr
"description in French" - collectionmeta collectionextra lmi
"description in Maori"
39Define Subcollection
- Example A collection has three indexes
- the whole collection
- the Food and Nutrition Bulletin
- the remaining documents.
- fn,other subsection name
- i means case sensitive
40Cross-Collection Searching
- Greenstone has a facility for cross-collection
searching - Allows several collections to be searched at
once, with the results combined behind the scenes
as though you were searching a single unified
collection. - supercollection col_1 col_2 .
41Configuration File Example
42Building the Collection
43Plugins
- Plugins parse the imported documents and extract
metadata from them. - For example, the HTML plugin converts HTML pages
to the Greenstone archive format and extracts
metadata such as titles, enclosed by
lttitlegtlt/titlegt tags. - Written in the Perl language.
- All derive from a basic plugin called BasPlug,
- Plugins are kept in the perllib/plugins
directory. - To find more about any plugin, just type perl -S
pluginfo.pl plugin-name at the command prompt.
44Plugins
- A general plugin called ConvertToPlug invokes the
appropriate conversion program and passes the
result to either TEXTPlug or HTMLPlug. - Adding a new external document conversion utility
- Install the new conversion utility so that it is
accessible by Greenstone (put it in the packages
directory). - Alter gsConvert.pl to use the new conversion
utility. This involves adding a new clause to the
if statement in the main function, and adding a
function that calls the conversion utility. - Write a top-level plugin that inherits from
ConvertToPlug to catch the format and pass it on.
45Plugins Operations
46Plugins of Greenstone
47Plugins of Greenstone
48Plugins-specific Operation
49Plugins-specific Operation
50Collection Specific
51Adding Metadata to Documents
- When its use_metadata_files option is set,
RecPlug checks each input directory for an XML
file called metadata.xml andapplies its contents
to all the directorys files and subdirectories. - The standard plugin RecPlug also incorporates a
way of assigning metadata to documents from XML
files. - RecPlug checks each input directory for an XML
file called metadata.xml and applies its contents
to all the directorys files and subdirectories.
52Adding Metadata to Documents
53Tagging document files
- HTML plugin has a description_tags
54Classifier
- Classifiers are used to create a collections
browsing indexes. - Examples are the collections Titles A-Z index,
and the Subject, How to, Organisation - Any document in the collection that does not have
this metadata defined will be omitted from the
classifier (but it is still indexed, and
consequently searchable). - Collection-specific classifiers can be written,
and are stored in the collections
perllib/classify directory.
55Hierarchical Classifier
- Example classify Hierarchy hfile sub.txt
metadata Subject sort Title - The hierarchy for classification is stored in a
simple text format in sub.txt.
56Greenstone Classifier
- Each classifier receives an implicit name from
its position in the configuration file - For example, the third classifier specified in
the file is called CL3. - classifiers can be written, and are stored in the
collections perllib/classify directory.
57Hierarchical Classifier
- The hfile defines the metadata hierarchy
- Identifier, which matches the value of the
metadata (given by the metadata argument) to the
classification. - Position-in-hierarchy marker, in multi-part
numeric form, e.g. 2, 2.12, 2.12.6. - The name of the classification. (If this contains
spaces, it should be placed in quotation marks.)
58How Classifier Works
- Classifiers are Perl objects, derived from
BasClas.pm - The new method creates the classifier object.
- The init method initialises the object with
parameters such as metadata type, button name and
sort criterion - The classify method is invoked once for each
document, and stores information about the
classification made within the classifier object. - The get_classify_info method returns the locally
stored classification information to the build
process, which it then writes to the collection
information database for use when the collection
is displayed at runtime.
59Formatting Greenstone Output
- The web pages are generated on the fly as they
are needed. - The appearance of many aspects of the pages is
controlled using format strings. - Format strings belong in the collection
configuration file, introduced by the keyword
format followed by the name of the element to
which the format applies. - Two different kinds of page element that are
controlled by format strings. - the items on the page that show documents or
parts of documents. - the lists produced by classifiers and searches.
60Formatting Greenstone lists
- Classifiers CL1, CL2, CL3,(for AZList,
DateList, etc) - Collect.cfg specify the list formatting to disply
61Format Operations
62Formatting Example
63Formatting Example
64Controlling the Greenstone User Interface
- The entire Greenstone user interface is
controlled by macros which reside in the
GSDLHOME/macros directory. - All macro files used by Greenstone are listed in
GSDLHOME/etc/main.cfg - Macro files have a .dm extension.
- base.dm defines the basic content of a page.
- Each file defines one or more packages, each
containing a series of macros
65Macro Syntax
- Syntax
- _content_ ltpgtlth2gtOopslt/h2gt_textdefaultcontent_
- Macros often contain conditional statements.
- _If_(x,y,z), where x is a condition, y is the
macro content to use if that condition is true,
and z the content if it is false.
66About Page Content Example
67Building Process
- the text is compressed
- the full-text indexes that are specified in the
collection configuration file are created. - information about how the collection is to appear
on the web is precalculated and incorporated into
the collection - for example information about icons and titles,
and information produced by classifiers. - All these steps are handled by mgbuilder (or the
collection-specific builder), which in turn uses
the MG (Managing Gigabytes, see Witten et al.,
1999) software for compressing and indexing.
68Building Process
- To make a collection available over the web once
it is built, you must move it from the
collections building directory to the index
directory. - Collections are not built directly into index
because large collections may take hours or days
to build. - It is important that the building process does
not affect an existing copy of the collection
until the build is complete.
69(No Transcript)
70Viewing the Collection
- A facility for E-mail to be sent to the
collections contact E-mail address, and to the
systems administrator - The facility is disabled by default but can be
enabled by editing the main.cfg configuration
file
71Greenstone Archive Format
- All source documents are brought into the
Greenstone system by converting them to a format
known as the Greenstone Archive Format. - The ltSectiongt tag denotes the start of each
document section, and the corresponding
lt/Sectiongt closing tag marks the end of that
section - Following each ltSectiongt tag is a ltDescriptiongt
section - Within this come any number of ltMetadatagt
elements. - The Dublin Core metadata standard is used for
defining metadata types
72Greenstone Archive Format
- lt?xml version"1.0" ?gt
- lt!DOCTYPE GreenstoneArchive SYSTEM
- "http//greenstone.org/dtd/GreenstoneArchive/1.0/G
reenstoneArchive.dtd"gt - ltSectiongt
- ltDescriptiongt
- ltMetadata name"gsdlsourcefilename"gtec158e.txtlt/Me
tadatagt - ltMetadata name"Title"gtFreshwater Resources in
Arid Landslt/Metadatagt - ltMetadata name"Identifier"gtHASH0158f56086efffe592
636058lt/Metadatagt - ltMetadata name"gsdlassocfile"gtcover.jpgimage/jpe
glt/Metadatagt - ltMetadata name"gsdlassocfile"gtp07a.pngimage/png
lt/Metadatagt - lt/Descriptiongt
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"Title"gtPrefacelt/Metadatagt
- lt/Descriptiongt
- ltContentgt
- This is the text of the preface
- lt/Contentgt
- lt/Sectiongt
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"Title"gtFirst and only
chapterlt/Metadatagt - lt/Descriptiongt
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"Title"gtPart 1lt/Metadatagt
- lt/Descriptiongt
- ltContentgt
- This is the first part of the first and only
chapter - lt/Contentgt
- lt/Sectiongt
- ltSectiongt
- ltDescriptiongt
- ltMetadata name"Title"gtPart 2lt/Metadatagt
- lt/Descriptiongt
- ltContentgt
- This is the second part of the first and only
chapter
73Dublin Core Metadata
74Inside Greenstone Archive Documents
- Documents are divided into paragraphs.
- They can be split hierarchically into sections
and subsections - Each document has an associated Object Identifier
or OID - For example, subsection 3 of section 2 of
document HASHa7 is referred to as HASHa7.2.3.
75Hierarchy Structure in Greenstone Documents
76Administration
- The administrative facility also presents
configuration information about the installation
and allows it to be modified.
77Collection Info
78Configuration files
- site configuration file gsdlsite.cfg
- the name of the directory where the Greenstone
software is kept - the HTTP address of the Greenstone system
- whether the fastcgi facility is being used.
- the main configuration file main.cfg.
- common to the interface of all collections served
by a Greenstone site. - E-mail address of the system maintainer
- whether the status and collector pages are
enabled - whether logs of user activity are kept
- whether Internet cookies are used to identify
users.
79Logs
- usage logs
- error logs
- initialization logs.
- Logging disabled by default
- is enabled by including the lines in the main
system configuration - logcgiargs true
- turns logging on and off.
- usecookies true
- assigned to each user, which enables
80User Management
- List users
- Add a new user
- Change password
- Groups
- Administrator
- add and remove users, and change their groups
- Colbuilder
- access the facilities described above to build
new collections and alter (and delete) existing
ones - Default user
- admin
81Technical Information
- General
- gives access to technical information, including
the directories where things are stored - Protocols
- gives, for each possible protocol type,
information about each of the collections
supported by that protocol. - Actions
- Use user interface code (called the
receptionist) to communicate the wishes of the
user - These actions correspond to the CGI argument
labeled a - For example, if astatus the receptionist invokes
the status action (which displays the status
page).