Title: The INFOMINE project
1The INFOMINE project
- Gordon Paynter
- Infomine Lead Programmer
- and the Infomine team
- Steve Mitchell, Margaret Mooney, Julie Mason et
al. - at the University of California, Riverside
2The Infomine Project
- Introduction to Infomine
- The core Infomine system
- Automation finding and describing resources
- Collaboration the Fiat Lux portals
- Conclusions
3Introduction to Infomine
- Infomine is a virtual library
- Infomine's goal is to provide organised access to
the Internet in the same way that we do for
printed works - Library catalogs focus on books and periodicals
- Infomine focuses on web sites (mostly, now)
- There are many differences between books and web
sites
4Books Vs. Web sites
- Books
- Easily-defined, physical objects
- Static
- Permanent
- LC 119 million items
- Web sites
- What is a web site anyway?
- Continually changing
- Frequently disappear
- Google 2 billion pages
5Books Vs. Web sites
- Books
- Limited number of publishers
- Existing, coordinated cataloging effort
- Text not usually electronically available
- Web sites
- Anyone can publish
- Few indexers Infomine, LII, IPL, BUBL, MEL,
Scout all are post-hoc - Can be downloaded and processed
6Simplifying the problem
- Editorial standards
- Only select the best Web sites
- Automated assistance
- Collection building
- Automated and semi-automated resource description
- Catalog maintenance
- Wide collaboration
- More contributors
- Less redundant effort
7The core Infomine system
- Infomine for patrons
- Behind the scenes Infomine for content builders
- Open source inputs what the community gives us
- Open source outputs what we're distributing
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Infomine core behind the scenes
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17Infomine core open source inputs
- The Linux operating system
- Debian GNU/Linux
- Infrastructure
- The Apache webserver
- MySQL and Berkeley DB databases
- Programming tools
- The GNU Compiler (gcc) and libraries, emacs
- Common libraries
18Infomine core open source outputs
- The Infomine general-purpose library
- http//infomine.ucr.edu/iVia/
- The full libInfomine library
- Available in August (as documentation completed)
- The full Infomine source
- Available Fall 2002
19Automation finding and describing resources
- Discovering new resources
- The Infomine record builder
- Extracting useful metadata
- Automatically classifying records
- Open source inputs
- Open source outputs
20Discovering new resources
- The semi-automatic focused web crawler
- You suggest a topic or search term
- The crawler searches for web pages and clusters
them - You identify useful clusters of documents
(optional) - The crawler reports the top 20 hubs and
authorities - You choose from the list of URLs
- The automatic record builder helps generate
metadata - The fully-automatic focused web crawler
- Coming soon!
21The Infomine record builder
- Input a URL or list of URLs
- From the focused crawler
- From the record builder interface
- The record builder creates a new record
- Fully-automatic operation
- The builder creates new records on its own
- Semi-automatic operation
- The builder interacts with you at each stage
- Output new records in the pending database
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32New research LCSH assignment
- Dr. Steve Jones, of the University of Waikato
- Aim assign LCSH based on document content
- Use training data to build a model
- Training data documents with keyphrases and LCSH
- Model based on keyphrase and LCSH co-occurrence
- Use model to assign LCSH to new documents
- Extract keyphrases with Kea
- Similarity measures identify the best LCSH
33New research LCSH assignment
- forest insects
- bark beetles
- borers (insects)
- tobacco hornworm
- scolytidae
- greenhouse whitefly
- agriculture in literature
- mountain pine beetle
34New research LCSH assignment
- BRASSICA
- CROPS
- PLANT BREEDING
- cruciferae
- Buriats
- brassica
- phytophagous insects
- plants, effect of metals on
- blood groups in animals
- rapeseed
- hybridization, vegetable
35New research LCSH assignment
- CLIMATOLOGY
- ENVIRONMENTAL SCIENCES
- POLLUTION
- atmospheric chemistry
- meteorology
- continentality (meteorology)
- chemical oceanography
- multidimensional chromatography
- turbulent diffusion (meteorology)
- aerosols
- precipitation scavenging
36New research LCC assignment
- Dr. Eibe Frank, of the University of Waikato
- Aim assign LCC based on a set of LCSH
- Infomine has LCSH but no LCC
- Use with LCSH classifier for new documents
- Use training data to build a model
- Training data documents with LCSH and LCC
- Model LCC-hierarchy of Support Vector Machines
- Use model to assign LCC to new documents
37New research LCC assignment
- Performance (preliminary)
- Absolute accuracy around 58 (pleasing)
- Also 4 are too specific, 3 too general
- Top-level accuracy around 80
- What to do if we encounter completely new LCSH?
- QA1-43 Science Mathematics General
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Automation open source inputs
- General and C tools
- Linux, Apache, gcc, flex, curl, etc
- Java tools
- The Java MARC Events (James) toolkit
- The Waikato Environment for Knowledge Analysis
(WEKA) machine learning toolkit - The Kea keyphrase extraction program
42Automation open source outputs
- LCSHtoLCC LCC assignment
- http//infomine.ucr.edu/iVia/
- KPtoLCSH LCSH assignment
- Available August
- PhraseRate keyphrase extractor
- http//infomine.ucr.edu/iVia/
- Artur's Automatic Annotator
- Available in Fall 2002 (with Infomine)
43Collaboration the Fiat Lux Portals
- Fiat Lux
- Advantages of collaboration
- MyIResearch guides and pathfinders
- Themes co-branding for collaborators
- Open standards, protocols and source code
- Challenges of collaboration
44Fiat Lux
- Established at ALA Midwinter 2002
- Prominent, librarian-built, public portals
- BUBL, Infomine, IPL, lii.org, MEL, VRL
- Goal resource sharing through collaboration
- Fiat Lux represents
- 170 librarians
- 100,000 records
- 30 million searches/year
45Advantages of collaboration
- Greater sustainability and scalability
- Reduced redundant effort
- Shared cataloging effort
- More resources cataloged
- Everyone gets a bigger (better) dataset
- Shared systems development
- Scalability of systems
- Preserving institutional identity
46Themes co-branding through iVia
- Co-branding for institutional cooperators
- Many data views can be themed
- The data is the same
- The appearance is altered
- http//infomine.ucr.edu/cgi-bin/canned_search?que
rytreethemewfu
47(No Transcript)
48MyI custom collections
- Create research guides / pathfinders
- Create a MyI category
- Add records to categories in the record editor
- Batch add to your category
- Create searches for your records
- Examples
- CSUF-MC, CSUF-MC-NATAM, CSUF-MC-ASIAM...
- UCR-DB-MUSIC, UCR-ACCESS-CDL-PASSWORD
- UDM-edu459
49(No Transcript)
50Challenges of collaboration
- Investigating lii.org integration
- Granularity of metadata
- Different editorial processes
- Collection focus and audience level
- Scholarly Vs. K-12 Vs. public library
- How do you merge duplicate records?
- LCSH, keywords easy to combine
- Annotation not sure yet
- These are editorial issues rather than technical
51Current collaborators
- Fiat Lux
- LITA Internet Portals Interest Group
- NSF / NSDL
- Contributing to the content-building effort
52Current collaborators
- University of California contributors
- University of Detroit Mercy
- Wake Forest University
- California State University contributors
- Cornell University
- Library of Congress / BEOnline
53Financial support
- FIPSE Fund for the Improvement of Post-secondary
Education (U.S. Department of Education) - IMLS National Leadership Grant
- The Library of University of California at
Riverside
54Join us!
- Cataloging the Internet (or even just the best
bits) requires collaboration, but we recognise
that most collaborators have different needs. - Open standards and software offer us scalability
and flexibility, and provide us with a base of
work on which to build, and ensure our work will
continue to be free. - http//infomine.ucr.edu/iVia/