Greenstone Digital Library - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

Greenstone Digital Library

Description:

Adding a new external document conversion utility. Install the new conversion utility so that it is accessible by Greenstone (put ... – PowerPoint PPT presentation

Number of Views:1813
Avg rating:3.0/5.0
Slides: 82
Provided by: suhs3
Category:

less

Transcript and Presenter's Notes

Title: Greenstone Digital Library


1
Greenstone Digital Library
  • http//www.greenstone.org/

Speaker Su-Hsien Huang E-mail
sshuang_at_cis.nctu.edu.tw Ext 56647 Department of
Computer and Information Science
2
Contents
  • Introduction
  • Installation
  • Greenstone Demo
  • The Collector

3
Introduction
  • Greenstone
  • produced by the New Zealand Digital Library
    Project at the University of Waikato, and
    developed and distributed in cooperation with
    UNESCO and the Human Info NGO.
  • open-source, multilingual software
  • a suite of software for building and distributing
    digital library collections.
  • a comprehensive system for constructing and
    presenting collections of thousands or millions
    of documents, including text, images, audio and
    video.
  • Download http//prdownloads.sourceforge.net/green
    stone/gsdl-2.39-win32.exe
  • Collection
  • A set of documents with metadata
  • a typical digital library built with Greenstone
    will contain many collections
  • search for particular words
  • browse documents by subject

4
Finding Information
  • Full-text indexes
  • searched for particular words, combinations of
    words, or phrases, and results are ordered
  • Metadata
  • descriptive data such as author, title, date,
    keywords, and so on, is associated with each
    document

5
Document Formats
  • Greenstone support several kinds of document
    formats
  • Plain text
  • HTML
  • WORD
  • PDF
  • Usenet
  • E-mail messages.
  • Converted into a standard XML form for indexing

6
Multimedia and Multilingual Documents
  • Non-textual material is either linked into the
    textual documents or accompanied by textual
    descriptions
  • To allow full-text searching and browsing.
  • Encode by Unicode
  • a standard scheme for representing the character
    sets used in the worlds languages, is used
  • Arabic, Chinese, English, French, Mäori and
    Spanish

7
Installation
  • Installation
  • Uninstallation

8
Version
9
Directory
10
Directory
11
Directory
12
Documentation
  • Greenstone Digital Library Installers Guide
  • Greenstone Digital Library Users Guide
  • Greenstone Digital Library Developers Guide
  • Greenstone Digital Library From Paper to
    Collection

13
Greenstone Demo
14
Icons at The Top
  • This takes you to the about page
  • This takes you to the Digital Librarys home
    page, from which you can select another
    collection
  • This provides help text similar to what you are
    reading now
  • This allows you to set some user interface and
    searching options that will then be used
    henceforth

15
Icons on The Search/Browse Bar
  • Search for particular words
  • Access publications by subject
  • Access publications by title
  • Access publications by organization
  • Access publications by how to listing

16
A Book in The DemoCollection
17
Browsing Icons
  • Click on a book icon to read the corresponding
    book
  • Click on a bookshelf icon to look at books on
    that subject
  • View this document
  • Open this folder and view contents
  • Click on this icon to close the book
  • Click on this icon to close the folder
  • Click on the arrow to go on to the next section
    ...... or back to the previous section
  • Open this page in a new window
  • Expand table of contents
  • Display all text
  • Highlight search terms

18
Searching
  • Twenty matching documents will be shown.
  • A maximum of 100 is imposed on the number of
    documents returned.
  • You can change these numbers by clicking the
    preferences button at the top of the page
  • Each search term contains nothing but alphabetic
    characters and digits.
  • Terms are separated by white space.
  • punctuations are ignored.
  • For example
  • Agro-forestry in the Pacific Islands Systems
    for Sustainability (1993)
  • will be treated the same as
  • Agro forestry in the Pacific Islands Systems
    for Sustainability 1993

19
Searching
  • Query type
  • all the words.
  • some of the words.
  • Documents are displayed in order of how closely
    they match the query.
  • Scope of queries
  • author or title
  • Chapter or paragraph or whole document

20
Advanced search features
  • Set in preferences button
  • Case sensitivity and stemming
  • African building will be treated the same as
    africa builds
  • Phrase searching
  • post-retrieval scan?
  • Advanced query mode
  • logical operators (and), (or), and ! (not)
  • AND, OR, and NOT are treated as ordinary search
    terms
  • Using search history

21
Changing the preferences
  • Collection preferences
  • Language preferences
  • Interface preference
  • English, Chinese, .
  • Encoding
  • UTF-8, Big5,
  • Interface format
  • Graphical, textual

22
The Preference Page
23
The Collector
  • The Collector is a facility that helps you create
    new collections, modify or add to existing ones,
    or delete collections.
  • create a new collection with the same structure
    as an existing one
  • create a new collection with a different
    structure from existing ones
  • add new material to an existing collection
  • modify the structure of an existing collection
  • delete a collection and
  • write an existing collection to a self-contained,
    self-installing CDROM.

24
The collection
  • The structure of a particular collection is
    determined when the collection is set up. This
    includes
  • the format of the source documents
  • how they should be displayed on the screen
  • the source of metadata
  • what browsing facilities should be provided
  • what full-text search indexes should be provided
  • and how the search results should be displayed.

25
Logging In
  • Default log in name is Admin
  • Password will be set up in installation

26
Dialog structure
  • After logging in, sequence of steps are involved
    in collection building
  • Collection information
  • specify the collections name and associated
    information
  • Source data
  • where the source data is to come from.
  • Configuring the collection
  • adjust the configuration options
  • Building the collection
  • makes all the indexes and gathers together any
    other information that is required to make the
    collection
  • Viewing the collection.

27
Collection Information
  • Title
  • Contact E-mail address
  • Brief description.

28
Collection Information
  • Configure file is in "GSDLHOME\collect\colname
    \colname.cfg
  • Attribute
  • Name collectionmeta collectionname
  • Icon collectionmeta iconcollection
  • Description collectionmeta collectionextra

29
Source Data
  • Collection may contain
  • HTML documents (.htm, .html)
  • plain text documents (.txt, .text)
  • Microsoft Word documents (.doc),
  • PDF documents (.pdf)
  • E-mail documents (.email).

30
Source Data
  • There are three kinds of specification
  • a directory name on the Greenstone server system
    (beginning with file//)
  • An address beginning with http// for files to
    be downloaded from the web
  • an address beginning with ftp// for files to
    be downloaded using anonymous FTP.
  • Sources might be unavailable because
  • the file, FTP site or URL does not exist
  • you need to dial up your ISP first
  • you are trying to access a URL from behind a
    firewall.

31
Working with Existing Collections
  • To work with an existing collection, you first
    select the collection from a list that is
    provided.
  • Some collections are write protected
  • With the collection, you can
  • Add more data and rebuild the collection
  • Edit the collection configuration file
  • Delete the collection entirely
  • Export the collection to CD-ROM.

32
Document Formats
  • When building collections, Greenstone processes
    each different format of source document by
    seeking a plugin that can deal with that
    particular format.
  • Plugins are specified in the collection
    configuration file.
  • Greenstone generally uses the filename to
    determine document formats
  • foo.txt is processed as a text file
  • foo.html as HTML
  • foo.doc as a Word file.

33
Document Formats
  • TEXTPlug (.txt, .text)
  • adds title metadata based on the first line of
    the file.
  • HTMLPlug .htm, .html, .shtml, .shm, .asp, .php,
    .cgi )
  • It extracts title metadata based on the lttitlegt
    tag
  • other metadata expressed using HTMLs metatag
    syntax can be extracted too.
  • WORDPlug (.doc)
  • uses independent programs to convert Word files
    to HTML
  • PDFPlug (.pdf)
  • Use pdftohtml, to convert PDF files to HTML.
  • PSPlug (.ps)
  • Use a standard Linux program, called ps2ascii, to
    convert
  • EMAILPlug (.email)
  • The plugin extracts Subject, To, From, and Date
    metadata
  • this plugin does not yet handle MIME-encoded
    E-mails properly
  • ZIPPlug (.gz, .z, .tgz, .taz, .bz, .zip, .tar)
  • It relies on the programs gunzip, bunzip, unzip,
    and tar, which are standard Linux utilities
  • ZIPPlug is disabled on Windows computers.

34
Importing Process
  • convert documents from their native format into
    the Greenstone Archive Format used within
    Greenstone
  • write a summary file (called archives.inf) which
    will be used when the collection is built.

35
(No Transcript)
36
Configuring the Collection
37
Collection Configuration File
38
Collection Configuration File
  • Indexes determine what collection indexes are
    created
  • Indexes can be constructed at the document,
    section, and paragraph levels.
  • Example
  • Collectionextra
  • describing the collection.
  • collectionmeta collectionextra "collection
    description"
  • collectionmeta collectionextra lfr
    "description in French"
  • collectionmeta collectionextra lmi
    "description in Maori"

39
Define Subcollection
  • Example A collection has three indexes
  • the whole collection
  • the Food and Nutrition Bulletin
  • the remaining documents.
  • fn,other subsection name
  • i means case sensitive

40
Cross-Collection Searching
  • Greenstone has a facility for cross-collection
    searching
  • Allows several collections to be searched at
    once, with the results combined behind the scenes
    as though you were searching a single unified
    collection.
  • supercollection col_1 col_2 .

41
Configuration File Example
42
Building the Collection
43
Plugins
  • Plugins parse the imported documents and extract
    metadata from them.
  • For example, the HTML plugin converts HTML pages
    to the Greenstone archive format and extracts
    metadata such as titles, enclosed by
    lttitlegtlt/titlegt tags.
  • Written in the Perl language.
  • All derive from a basic plugin called BasPlug,
  • Plugins are kept in the perllib/plugins
    directory.
  • To find more about any plugin, just type perl -S
    pluginfo.pl plugin-name at the command prompt.

44
Plugins
  • A general plugin called ConvertToPlug invokes the
    appropriate conversion program and passes the
    result to either TEXTPlug or HTMLPlug.
  • Adding a new external document conversion utility
  • Install the new conversion utility so that it is
    accessible by Greenstone (put it in the packages
    directory).
  • Alter gsConvert.pl to use the new conversion
    utility. This involves adding a new clause to the
    if statement in the main function, and adding a
    function that calls the conversion utility.
  • Write a top-level plugin that inherits from
    ConvertToPlug to catch the format and pass it on.

45
Plugins Operations
46
Plugins of Greenstone
47
Plugins of Greenstone
48
Plugins-specific Operation
49
Plugins-specific Operation
50
Collection Specific
51
Adding Metadata to Documents
  • When its use_metadata_files option is set,
    RecPlug checks each input directory for an XML
    file called metadata.xml andapplies its contents
    to all the directorys files and subdirectories.
  • The standard plugin RecPlug also incorporates a
    way of assigning metadata to documents from XML
    files.
  • RecPlug checks each input directory for an XML
    file called metadata.xml and applies its contents
    to all the directorys files and subdirectories.

52
Adding Metadata to Documents
53
Tagging document files
  • HTML plugin has a description_tags

54
Classifier
  • Classifiers are used to create a collections
    browsing indexes.
  • Examples are the collections Titles A-Z index,
    and the Subject, How to, Organisation
  • Any document in the collection that does not have
    this metadata defined will be omitted from the
    classifier (but it is still indexed, and
    consequently searchable).
  • Collection-specific classifiers can be written,
    and are stored in the collections
    perllib/classify directory.

55
Hierarchical Classifier
  • Example classify Hierarchy hfile sub.txt
    metadata Subject sort Title
  • The hierarchy for classification is stored in a
    simple text format in sub.txt.

56
Greenstone Classifier
  • Each classifier receives an implicit name from
    its position in the configuration file
  • For example, the third classifier specified in
    the file is called CL3.
  • classifiers can be written, and are stored in the
    collections perllib/classify directory.

57
Hierarchical Classifier
  • The hfile defines the metadata hierarchy
  • Identifier, which matches the value of the
    metadata (given by the metadata argument) to the
    classification.
  • Position-in-hierarchy marker, in multi-part
    numeric form, e.g. 2, 2.12, 2.12.6.
  • The name of the classification. (If this contains
    spaces, it should be placed in quotation marks.)

58
How Classifier Works
  • Classifiers are Perl objects, derived from
    BasClas.pm
  • The new method creates the classifier object.
  • The init method initialises the object with
    parameters such as metadata type, button name and
    sort criterion
  • The classify method is invoked once for each
    document, and stores information about the
    classification made within the classifier object.
  • The get_classify_info method returns the locally
    stored classification information to the build
    process, which it then writes to the collection
    information database for use when the collection
    is displayed at runtime.

59
Formatting Greenstone Output
  • The web pages are generated on the fly as they
    are needed.
  • The appearance of many aspects of the pages is
    controlled using format strings.
  • Format strings belong in the collection
    configuration file, introduced by the keyword
    format followed by the name of the element to
    which the format applies.
  • Two different kinds of page element that are
    controlled by format strings.
  • the items on the page that show documents or
    parts of documents.
  • the lists produced by classifiers and searches.

60
Formatting Greenstone lists
  • Classifiers CL1, CL2, CL3,(for AZList,
    DateList, etc)
  • Collect.cfg specify the list formatting to disply

61
Format Operations
62
Formatting Example
63
Formatting Example
64
Controlling the Greenstone User Interface
  • The entire Greenstone user interface is
    controlled by macros which reside in the
    GSDLHOME/macros directory.
  • All macro files used by Greenstone are listed in
    GSDLHOME/etc/main.cfg
  • Macro files have a .dm extension.
  • base.dm defines the basic content of a page.
  • Each file defines one or more packages, each
    containing a series of macros

65
Macro Syntax
  • Syntax
  • _content_ ltpgtlth2gtOopslt/h2gt_textdefaultcontent_
  • Macros often contain conditional statements.
  • _If_(x,y,z), where x is a condition, y is the
    macro content to use if that condition is true,
    and z the content if it is false.

66
About Page Content Example
67
Building Process
  • the text is compressed
  • the full-text indexes that are specified in the
    collection configuration file are created.
  • information about how the collection is to appear
    on the web is precalculated and incorporated into
    the collection
  • for example information about icons and titles,
    and information produced by classifiers.
  • All these steps are handled by mgbuilder (or the
    collection-specific builder), which in turn uses
    the MG (Managing Gigabytes, see Witten et al.,
    1999) software for compressing and indexing.

68
Building Process
  • To make a collection available over the web once
    it is built, you must move it from the
    collections building directory to the index
    directory.
  • Collections are not built directly into index
    because large collections may take hours or days
    to build.
  • It is important that the building process does
    not affect an existing copy of the collection
    until the build is complete.

69
(No Transcript)
70
Viewing the Collection
  • A facility for E-mail to be sent to the
    collections contact E-mail address, and to the
    systems administrator
  • The facility is disabled by default but can be
    enabled by editing the main.cfg configuration
    file

71
Greenstone Archive Format
  • All source documents are brought into the
    Greenstone system by converting them to a format
    known as the Greenstone Archive Format.
  • The ltSectiongt tag denotes the start of each
    document section, and the corresponding
    lt/Sectiongt closing tag marks the end of that
    section
  • Following each ltSectiongt tag is a ltDescriptiongt
    section
  • Within this come any number of ltMetadatagt
    elements.
  • The Dublin Core metadata standard is used for
    defining metadata types

72
Greenstone Archive Format
  • lt?xml version"1.0" ?gt
  • lt!DOCTYPE GreenstoneArchive SYSTEM
  • "http//greenstone.org/dtd/GreenstoneArchive/1.0/G
    reenstoneArchive.dtd"gt
  • ltSectiongt
  • ltDescriptiongt
  • ltMetadata name"gsdlsourcefilename"gtec158e.txtlt/Me
    tadatagt
  • ltMetadata name"Title"gtFreshwater Resources in
    Arid Landslt/Metadatagt
  • ltMetadata name"Identifier"gtHASH0158f56086efffe592
    636058lt/Metadatagt
  • ltMetadata name"gsdlassocfile"gtcover.jpgimage/jpe
    glt/Metadatagt
  • ltMetadata name"gsdlassocfile"gtp07a.pngimage/png
    lt/Metadatagt
  • lt/Descriptiongt
  • ltSectiongt
  • ltDescriptiongt
  • ltMetadata name"Title"gtPrefacelt/Metadatagt
  • lt/Descriptiongt
  • ltContentgt
  • This is the text of the preface
  • lt/Contentgt
  • lt/Sectiongt
  • ltSectiongt
  • ltDescriptiongt
  • ltMetadata name"Title"gtFirst and only
    chapterlt/Metadatagt
  • lt/Descriptiongt
  • ltSectiongt
  • ltDescriptiongt
  • ltMetadata name"Title"gtPart 1lt/Metadatagt
  • lt/Descriptiongt
  • ltContentgt
  • This is the first part of the first and only
    chapter
  • lt/Contentgt
  • lt/Sectiongt
  • ltSectiongt
  • ltDescriptiongt
  • ltMetadata name"Title"gtPart 2lt/Metadatagt
  • lt/Descriptiongt
  • ltContentgt
  • This is the second part of the first and only
    chapter

73
Dublin Core Metadata
74
Inside Greenstone Archive Documents
  • Documents are divided into paragraphs.
  • They can be split hierarchically into sections
    and subsections
  • Each document has an associated Object Identifier
    or OID
  • For example, subsection 3 of section 2 of
    document HASHa7 is referred to as HASHa7.2.3.

75
Hierarchy Structure in Greenstone Documents
76
Administration
  • The administrative facility also presents
    configuration information about the installation
    and allows it to be modified.

77
Collection Info
78
Configuration files
  • site configuration file gsdlsite.cfg
  • the name of the directory where the Greenstone
    software is kept
  • the HTTP address of the Greenstone system
  • whether the fastcgi facility is being used.
  • the main configuration file main.cfg.
  • common to the interface of all collections served
    by a Greenstone site.
  • E-mail address of the system maintainer
  • whether the status and collector pages are
    enabled
  • whether logs of user activity are kept
  • whether Internet cookies are used to identify
    users.

79
Logs
  • usage logs
  • error logs
  • initialization logs.
  • Logging disabled by default
  • is enabled by including the lines in the main
    system configuration
  • logcgiargs true
  • turns logging on and off.
  • usecookies true
  • assigned to each user, which enables

80
User Management
  • List users
  • Add a new user
  • Change password
  • Groups
  • Administrator
  • add and remove users, and change their groups
  • Colbuilder
  • access the facilities described above to build
    new collections and alter (and delete) existing
    ones
  • Default user
  • admin

81
Technical Information
  • General
  • gives access to technical information, including
    the directories where things are stored
  • Protocols
  • gives, for each possible protocol type,
    information about each of the collections
    supported by that protocol.
  • Actions
  • Use user interface code (called the
    receptionist) to communicate the wishes of the
    user
  • These actions correspond to the CGI argument
    labeled a
  • For example, if astatus the receptionist invokes
    the status action (which displays the status
    page).
Write a Comment
User Comments (0)
About PowerShow.com