The INFOMINE project presentation

About This Presentation

Transcript and Presenter's Notes

Title: The INFOMINE project

1
The INFOMINE project

Gordon Paynter
Infomine Lead Programmer
and the Infomine team
Steve Mitchell, Margaret Mooney, Julie Mason et
al.
at the University of California, Riverside

2
The Infomine Project

Introduction to Infomine
The core Infomine system
Automation finding and describing resources
Collaboration the Fiat Lux portals
Conclusions

3
Introduction to Infomine

Infomine is a virtual library
Infomine's goal is to provide organised access to
the Internet in the same way that we do for
printed works
Library catalogs focus on books and periodicals
Infomine focuses on web sites (mostly, now)
There are many differences between books and web
sites

4
Books Vs. Web sites

Books
Easily-defined, physical objects
Static
Permanent
LC 119 million items

Web sites
What is a web site anyway?
Continually changing
Frequently disappear
Google 2 billion pages

5
Books Vs. Web sites

Books
Limited number of publishers
Existing, coordinated cataloging effort
Text not usually electronically available

Web sites
Anyone can publish
Few indexers Infomine, LII, IPL, BUBL, MEL,
Scout all are post-hoc
Can be downloaded and processed

6
Simplifying the problem

Editorial standards
Only select the best Web sites
Automated assistance
Collection building
Automated and semi-automated resource description
Catalog maintenance
Wide collaboration
More contributors
Less redundant effort

7
The core Infomine system

Infomine for patrons
Behind the scenes Infomine for content builders
Open source inputs what the community gives us
Open source outputs what we're distributing

8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Infomine core behind the scenes
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
Infomine core open source inputs

The Linux operating system
Debian GNU/Linux
Infrastructure
The Apache webserver
MySQL and Berkeley DB databases
Programming tools
The GNU Compiler (gcc) and libraries, emacs
Common libraries

18
Infomine core open source outputs

The Infomine general-purpose library
http//infomine.ucr.edu/iVia/
The full libInfomine library
Available in August (as documentation completed)
The full Infomine source
Available Fall 2002

19
Automation finding and describing resources

Discovering new resources
The Infomine record builder
Extracting useful metadata
Automatically classifying records
Open source inputs
Open source outputs

20
Discovering new resources

The semi-automatic focused web crawler
You suggest a topic or search term
The crawler searches for web pages and clusters
them
You identify useful clusters of documents
(optional)
The crawler reports the top 20 hubs and
authorities
You choose from the list of URLs
The automatic record builder helps generate
metadata
The fully-automatic focused web crawler
Coming soon!

21
The Infomine record builder

Input a URL or list of URLs
From the focused crawler
From the record builder interface
The record builder creates a new record
Fully-automatic operation
The builder creates new records on its own
Semi-automatic operation
The builder interacts with you at each stage
Output new records in the pending database

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
New research LCSH assignment

Dr. Steve Jones, of the University of Waikato
Aim assign LCSH based on document content
Use training data to build a model
Training data documents with keyphrases and LCSH
Model based on keyphrase and LCSH co-occurrence
Use model to assign LCSH to new documents
Extract keyphrases with Kea
Similarity measures identify the best LCSH

33
New research LCSH assignment

forest insects

forest insects
bark beetles
borers (insects)
tobacco hornworm
scolytidae
greenhouse whitefly
agriculture in literature
mountain pine beetle

34
New research LCSH assignment

BRASSICA
CROPS
PLANT BREEDING

cruciferae
Buriats
brassica
phytophagous insects
plants, effect of metals on
blood groups in animals
rapeseed
hybridization, vegetable

35
New research LCSH assignment

CLIMATOLOGY
ENVIRONMENTAL SCIENCES
POLLUTION

atmospheric chemistry
meteorology
continentality (meteorology)
chemical oceanography
multidimensional chromatography
turbulent diffusion (meteorology)
aerosols
precipitation scavenging

36
New research LCC assignment

Dr. Eibe Frank, of the University of Waikato
Aim assign LCC based on a set of LCSH
Infomine has LCSH but no LCC
Use with LCSH classifier for new documents
Use training data to build a model
Training data documents with LCSH and LCC
Model LCC-hierarchy of Support Vector Machines
Use model to assign LCC to new documents

37
New research LCC assignment

Performance (preliminary)
Absolute accuracy around 58 (pleasing)
Also 4 are too specific, 3 too general
Top-level accuracy around 80
What to do if we encounter completely new LCSH?
QA1-43 Science Mathematics General

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
Automation open source inputs

General and C tools
Linux, Apache, gcc, flex, curl, etc
Java tools
The Java MARC Events (James) toolkit
The Waikato Environment for Knowledge Analysis
(WEKA) machine learning toolkit
The Kea keyphrase extraction program

42
Automation open source outputs

LCSHtoLCC LCC assignment
http//infomine.ucr.edu/iVia/
KPtoLCSH LCSH assignment
Available August
PhraseRate keyphrase extractor
http//infomine.ucr.edu/iVia/
Artur's Automatic Annotator
Available in Fall 2002 (with Infomine)

43
Collaboration the Fiat Lux Portals

Fiat Lux
Advantages of collaboration
MyIResearch guides and pathfinders
Themes co-branding for collaborators
Open standards, protocols and source code
Challenges of collaboration

44
Fiat Lux

Established at ALA Midwinter 2002
Prominent, librarian-built, public portals
BUBL, Infomine, IPL, lii.org, MEL, VRL
Goal resource sharing through collaboration
Fiat Lux represents
170 librarians
100,000 records
30 million searches/year

45
Advantages of collaboration

Greater sustainability and scalability
Reduced redundant effort
Shared cataloging effort
More resources cataloged
Everyone gets a bigger (better) dataset
Shared systems development
Scalability of systems
Preserving institutional identity

46
Themes co-branding through iVia

Co-branding for institutional cooperators
Many data views can be themed
The data is the same
The appearance is altered
http//infomine.ucr.edu/cgi-bin/canned_search?que
rytreethemewfu

47
(No Transcript)
48
MyI custom collections

Create research guides / pathfinders
Create a MyI category
Add records to categories in the record editor
Batch add to your category
Create searches for your records
Examples
CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM...
UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD
UDM-edu459

49
(No Transcript)
50
Challenges of collaboration

Investigating lii.org integration
Granularity of metadata
Different editorial processes
Collection focus and audience level
Scholarly Vs. K-12 Vs. public library
How do you merge duplicate records?
LCSH, keywords easy to combine
Annotation not sure yet
These are editorial issues rather than technical

51
Current collaborators

Fiat Lux
LITA Internet Portals Interest Group
NSF / NSDL
Contributing to the content-building effort

52
Current collaborators

University of California contributors
University of Detroit Mercy
Wake Forest University
California State University contributors
Cornell University
Library of Congress / BEOnline

53
Financial support

FIPSE Fund for the Improvement of Post-secondary
Education (U.S. Department of Education)
IMLS National Leadership Grant
The Library of University of California at
Riverside

54
Join us!

Cataloging the Internet (or even just the best
bits) requires collaboration, but we recognise
that most collaborators have different needs.
Open standards and software offer us scalability
and flexibility, and provide us with a base of
work on which to build, and ensure our work will
continue to be free.
http//infomine.ucr.edu/iVia/

Write a Comment

User Comments (0)

About PowerShow.com

The INFOMINE project PowerPoint PPT Presentation