Title: Facetbased Knowledge and Records Management
1Facet-based Knowledge and Records Management
- Charlie Arp
- Battelle Memorial Institute
- Manager, Enterprise Content Management
- John Lontos
- UrsaNav EDRM Solutions
- Senior Director
- 5 March 2009
2Agenda
- Problem Statement
- The Importance of Metadata
- History of ECM at Battelle
- What has been done
- Next steps
- Summary
- Issues
- QA
3Problem Statement
- Effective information management requires more
metadata than users are willing to create
Description
Relation
Location
Creator
Format
Disposition Schedule
Project
Title
Revision
Date Created
Source
Class
Owner
Category
Date Registered
Author
Identifier
Hold(s)
Type
4Data Entry Example 1
5Data Entry Example 2
6Purpose of Metadata
- Resource Description
- Information Retrieval
- Management of Information Resources
- Documenting Ownership and Authenticity of Digital
Resources - Enabling Interoperability
Excerpts from the Chartered Institute of
Library and Information Professionals
7Kinds of Metadata
- Administrative metadata managing the object
- Ownership
- Provenance - chain of custody
- Digital preservation metadata
- Descriptive metadata finding the object
- Key concepts
- Taxonomic terms
- Terms extracted from the digital object
8Kinds of Metadata
- Free form (uncontrolled) metadata
- Social computing tagging
- Notes
- Searching full text
- Controlled thesaurus metadata
- Terms are pre-defined
- Putting a square peg in a round hole
- Searching known terms/categories
9Metadata Drivers Standards
- Knowledge sharing in support of research
- DOE-STD-4001-2000
- DOE Directive, O 243.1, RM Program
- DOE Directive, O 243.2, Vital Records
- DOE Directive, O 200.2, Information Collection
Management Program - ISO 15489, Records Management
- DoD Discovery Metadata Specification
- PREMIS, OCLC Preservation metadata
10What Users Want
- Minimal data entry (i.e., minimal metadata)
- Easy ways to add content
Social Drivers
Business Drivers
11Impact of poor metadata
- Where's that Dilbert?
- I desperately wanted to paste an old Dilbert
strip about a "boss stalker" in reply to Noella's
post. But I just couldn't seem to find it! How do
you search for old cartoon strips? There's no
metadata associated with these image files. I
searched Google (web, images) for "dilbert boss
stalker" without luck.1 - Electronic Basements
1. http//mannu.livejournal.com/87255.html
12Information retrieval - Full text search vs.
Metadata enhanced search
- Full text
- Returns an overwhelming amount of unfiltered
information - The user has to employ search strategies
- Boolean and/or proximity to cull through the
information - Metadata
- The information has already been filtered into
meaningful groups - The user is searching on known attributes
13Origin of Battelle ECM Office
- Battelle ECM was established in 2006 as Records
Management moved from Legal into the newly formed
Knowledge Management unit - KM was tasked to create a knowledge-rich
environment to enable Battelle to develop,
acquire, manage, and leverage knowledge assets.
14Origin of Battelle ECM Office
- RMO\ECM always viewed as the foundation of KM
- To efficiently create, capture, find, manage and
share knowledge (records) within the flow of
normal activities - Make using the ECM as easy as possible
15Battelles ECM Goal
- Facilitate management of electronic records
through intuitive searching and the automatic
generation of useful metadata - Simplify the search experience for users
- Knowledge workers spend 3.5 hours per week
searching for but not finding the information
they need IDC, 2005 - If the users cannot find what they put into the
RMA they will never use it - We needed to make it easier for the users to
submit records to TRIM - 50 of users said they would not use the RMA
because it was too difficult and time consuming
- NHPRC RMA project in MI, 2000
16What has been done
- Implementation of RMA (HP TRIM)
- 1 seat to 105 seats in 5 years
- Digital preservation (eg, fixity check utilities)
- Collecting maintaining permanent digital
objects - Portals
- SharePoint project and business sites accessing
the RMA - Text analytics
- Categorization tools, Taxonomies
- Faceted search tools
17Current ECM activities
- Integrated software packages using the
strengths of different applications - SharePoint for ease of use
- Clear Forest for automated creation of metadata
- TRIM for records keeping functionality
- Reliable and authentic records
18Department SharePoint SitesCurrent uses
- Up-load reports and search for and retrieve these
reports - Authentic official version of the report
remains in TRIM - Two sites being used
- 10 users
- 2,000 reports (?)
19Project SharePoint SitesProposed usage
- Selected document libraries are connected to
TRIM - Users submitting, searching for and retrieving
files in TRIM through SharePoint - Working files and Project reports
- Not the drafts library
- No lists no events, links, tasks, announcements,
or contacts - Nothing from discussion boards or surveys
portions of the site
20Sample Project Site
21Sample Site
22(No Transcript)
23(No Transcript)
24Clear Forest
- It extracts or creates metadata from structured
or unstructured content - Can be used on digital objects containing text -
documents, e-mail, databases, web sites, Excel - Uses semantic/linguistics and statistical
analysis - Server based application
25Clear Forest
- Entity extraction Identifies and tags metadata
based on grammatical rules - Lexicons Identifies and tags pre-defined word
lists - Key Concepts Identifies, tags and prioritizes
noun phrases found in documents - Categorization Assigns subject headings
(taxonomic terms) based on terms key words
defined by training sets
26Entity Extraction
- Results
- Negative
- No control you get whatever is in the
document - It will give you odd results from time to time
- Positive
- Easy and quick, we can run a large set of
documents almost immediately - Can give you surprisingly good results
27Entity Extraction
28Entity Extraction
- Technology
- Products
- Company
- Location
- Country
- City
- Region or state
29Lexicons
- Results
- Negative
- Creating the word lists can be time- consuming
- Quicker than a key word search but
difference? - Positive
- Very dependable can use phrases
- Can be very specific (good and bad)
- Easy and quick
30Key Concepts
- Results
- Negative
- Some of the noun phrases are inaccurate
- Will always be some throw away phrases
- Need at least 5 noun phrases
- Positive
- Easy to use
- Produces a good look into the document
31Categorization
- Results
- Negative
- Create a taxonomy
- Categories (taxonomy) must be distinct
- Creating training sets is time consuming
difficult - Positive
- Great results when it is done well
- Best metadata to enable enhanced search
32Categorization
- Assign a set of documents (known as a training
set) to a category (subject heading) - Uncategorized documents are assigned to a
category based on frequency of category
keywords identified within the document - Number of categories is unlimited
- Defining a taxonomy for the organization
- Can be difficult
- Is time consuming
33Categorization
- For each category Clear Forest defines a set of
positive documents and a set of neg. documents - Positive docs those in the category of interest
- Negative docs docs in other categories
- Words are scored based on frequency of
appearance in positive category - Words are scored based on frequency of
appearance in negative category
Combat Effectiveness
Smoke Obscurants Positive Docs
Toxicology
Counter- proliferation
Negative Docs
Counter- terrorism
And 17 other Categories
34 35(No Transcript)
36How it Works
Usable Metadata
- Title
- Author
- Creator
- Date Created
- Date Registered
- Project
- Contract
- Award Date
- Project Mgr
- Title
- Author
- Creator
- Date Created
- Date Registered
- Project
- Contract
- Award Date
- Project Mgr
- Categories
- Key Concepts
- Technology
- Products
- Company
- City
- Country
- Region or State
- Title
- Author
- Date Created
1) Document is uploaded to Project Sharepoint Site
2) Document is transferred to the TRIM repository
5) New attributes are exposed via sharepoint
faceted search
3) A copy of the document is passed to ClearForest
4) The TRIM record is updated with ClearForest
output
HP TRIM
37SharePoint Portal
38(No Transcript)
39(No Transcript)
40Records Management Application
- Compliant with DOE-STD-4001-2000
- Interfaces to Microsoft Applications
- Connectors to non-Microsoft data sources
41Next Steps
- TRIM/SharePoint integration
- Broader deployment
- Analysis of ongoing pilot projects- governance
document - Harvesting V2 sites ingest into TRIM
- Faceted search
- Fully automate Clear Forest updates
- Refinement of entity extraction
- Refinement of search facets
- Digital Preservation
- Automated Fixity Checks on digital objects
- Migrate from PDF to PDF/A
42Issues
- TRIM/SharePoint integration
- Support (TRIM and SharePoint expertise)
- 32 bit vs. 64 bit
- Faceted search
- Clear Forest will create erroneous metadata
- Guinea the country vs. Guinea the pig
- Programming support for Clear Forest
- Dial4j
- Just getting started production issues
43Summary
- Microsoft Sharepoint
- Enterprise Portal
- Project and Team Sites
- Collaboration Document Authoring
- Faceted Searching
- HP TRIM
- Unified Records Management Platform
- Vital Records Features
- Physical and Electronic Records Management
- Clear Forest
- Auto Categorization
- Entity Extraction
- Key Concept Tagging
- Benefits
- Easier for users to add content
- Easier for users to find information
- Improved service to customers
- Enhanced business intelligence
- Enhanced regulatory compliance
- Improved e-Discovery response
- Features
- Automatic application of rich metadata
- Streamlined user experience
- Seamless and fully automated integration
44Contact Information
Charlie Arp Manager, Enterprise Content
Management Battelle Memorial Institute (614)
424-7897 arpc_at_battelle.org
John Lontos Senior Director UrsaNav EDRM
Solutions (703) 625-9821 jlontos_at_ursanav.com