Title: The Future of Metadata
1The Future of Metadata
- Denise Bedford
- World Bank
- Presentation to Fall Metadata Forum
- November 2, 2005
- Department of Homeland Security
2Meta-Future
- Most of our information use and access today is
based on an anonymous access model - It is increasingly clear that anonymous access to
information and the packaging of information for
single use contexts is neither sufficient for
users nor an efficient use of development/engineer
ing resources - We need to think in terms of contextualization
and sensitization of information so that it can
be used in any context where it pertains - In the future, information will flow
information, not the systems in which it lives or
was created, will be our focus - Information needs to be agile and mobile it
needs to be sensitized to the contexts in which
it might be used, to the interests of those who
might use it, and to the applications that might
consume it
3Meta-Future
- Envision a future like that described in the
Netcentric Information Models formulated by the
Dept. of Defense - Information is created, tagged, posted and shared
- Any applications or users can according to
security privileges use any information they
can find, in any application they need to use to
do their work - Technology becomes increasingly invisible but
more logic based - More and different kinds of information such as
reference sources need to be managed and
maintained - This meta-future is heavily dependent upon the
existence of rich, conceptual, sensitized,
meaningful metadata - This future is now it is simply a practical
view of the Semantic Web
4The problem with metadata
- This future sounds wonderful and the
contextualization vision is exciting but theres
just one problemmetadata - Metadata.
- Is expensive and time consuming to create
- Is sometimes subjective and not granular enough
- Doesnt always address the ways that users and
systems think about the information it describes - May not tell us enough about the information to
trust it - may address only one context the context for
which it is created - May lives in the source application where it was
created - May not be as accessible as the information asset
- If a Meta-Future depends on metadata, we have to
solve these problems
5The problem with technologies
- Many of the tools are so tightly integrated, you
might generate rich metadata, but it will not
make your information agile or mobile - Statistical clustering engines do not get us to
persistent meaning or contextualization.
Clustering engines are great for thresholding or
pattern tracings, but they will not generate the
kind of metadata we need to realize this future - We need semantic engines at the base of all our
metadata efforts, and these engines need to be
available in multiple languages -- semantics vary
by language - Magic black box approaches are neither meaningful
nor sustainable -- you need to have access to the
programs through a user-friendly interface so you
can adapt them to your environment without having
to have programming knowledge - You need to have several different kinds of
technologies to do what Im going to describe
today not just one tool
6Content Dimension
Content Metadata
Region Scheme
Ideas Tacit Knowledge
Country Scheme
Content Elements Structure (XML)
Collection Development Policy
Banks Business Language
Topic Thesaurus
Content Quality Management
Business Activity Scheme
Programmatic Metadata Capture
Metadata Management
Topic Scheme
Concept Extraction
Anonymous Access (Context Free)
Information Diffusion (Context Sensitive
Group)_
Information Gathering Transformation (Context
Sensitive Person)
Business Process Awareness
Searching
Browsing
Translation Systems
Concept Filtering
Sense Making
Collaborative Filtering
Parametric Searching
Searching By Tools
Publishing
Results Clustering
Content Aggregation
Context Dimension
Individual Discovery
User-User Profile Matching
Knowledge Sharing
QA Systems
Results Sorting
Searching By Source
Text Classification
Syndication Engines
Directories of Expertise
Task Filtering
Workflow Management
Community Building
Individual Learning
Social Group SDI
Authentication Rules
Social Filtering
Threshold Filtering
Centralized Collections
Task Oriented SDI
Personal SDI
Advisory Services
Online Training
Authorization Rules
Recommender Engines
Communities SDI
Content Repurposing
User Dimension
Social Groups
Institutional Roles
Social Group Profiles
Client Profiles
Partner Profiles
Organizational Entities
Institutional Profiles
Individual Profiles
Communities Of Practice
Individual Profiles
Understanding the Dimensions of Contextualization
7Vision of Contextualization
- We need to address metadata challenges not in a
traditional way but in the future context with
the idea that metadata is contextualizable and
sensitized to support information agility and
mobility - In order to achieve contextualization you need to
have extreme metadata - Metadata about the information
- Metadata about the user
- Metadata about the context
- Rich metadata designed to meet many functional
requirements - Metadata in multiple languages
- Metadata needs to be interpretable for and in a
context - Reference sources not only for traditional
metadata but for all of the relationships and
logic that are present in an ontology (simply
different kinds of taxonomy representations) - Metadata must reflect any context or interest
that a user might express - Still need to have some control over metadata in
order to make it understandable in different
contexts
8New View of Ontology
Orgs Referenced
uses
Metadata
Contextual Matrix Sensiing
Contextual Logic
Rule Logic
People Referenced
Business Rule
Context
Topic Class Scheme
Has Meaning in
Content Entity1
User
Business Process Scheme
Has values
Has relationship to
Thesaurus
Has
Has
Metadata
uses
Content Parts
Country Names
Profile
Has
Region Names
Content Elements
Has
Metadata
Skill Sets/ Competencies
Contains
Has values
Content
Standard Statistical Variables
Hierarchy
Flat Taxonomy
Network Taxonomy
Faceted Taxonomy
Ring Taxonomy
9Getting to Rich Metadata
- Given the future demand for rich,
contextualizable metadata, and all of the
traditional drawbacks how will we achieve this
future - We need to look for a different model for
creating and sustaining metadata and reference
sources - We need to teach technologies how to capture the
metadata we need and how to maintain our
reference sources - Id like to show you an example of how we might
achieve that future - Please keep in mind that Im showing you an
example of what is possible Enterprise Search,
Authority Control/Entity Discovery
10Fueling Semantic Search With Metadata
- Or, .if Metadata is Dead, Semantic Web and
Semantic Search Are Dead
11Ring taxonomy
Ring taxonomy
Flat taxonomy
Hierarchical taxonomy
Fielded Search Faceted Taxonomy
12Ring Taxonomy
Metadata
Network Taxonomy
13More explicit View of faceted taxonomy
14Building and Maintaining Taxonomies
- Moving towards automated metadata generation
means that catalogers shift their effort to
reviewing the metadata generated and to more
fully developing and maintaining subject
headings/thesauri and classification schemes as
part of a suite of categorization tools - Level of effort shifts to training and developing
the tools and away from original cataloging and
metadata capture - Continue to work closely with subject experts to
define the controlled vocabularies and
classification schemes - It means that you have to have a metadata
infrastructure that looks something like that
ontology we just reviewed - There is no silver bullet ontology tool out there
that will do this work for you your knowledge
and skills are critical
15Metadata Capture Methods
Identification/ Distinction
Compliant Document Management
Search Browse
Use Management
Extrapolate from Business Rules
Programmatic Capture
Human Capture
Inherit from System Context
16Smart Use of Technologies
- Sample structure Bank Topics Classification
Scheme (hierarchical taxonomy) - Oracle data classes used to represent Topic
Classification scheme - hierarchical taxonomy as reference source for the
attribute Topic - used for Browse, Search, Content Syndication,
Personalization - 1st challenge is to architect the hierarchy
correctly - 3 distinct data classes, not a tree structure
with inheritance - Allows you to use the three data classes for
distinct functions across systems but still
enforce relationships across the classes
173 Oracle Data classes
Relationships across data classes
18Topic data class
19Subtopic Data Class
20Subsubtopic Data class
21Categorizing and Indexing Content
- Lets look at how were categorizing our content
to this structure automatically - Topic classification, geographical region
assignment, keywording examples - Can apply this approach to any kind of content
- Enables us to build a robust metadata repository
model, with strong metadata quality, to move
towards SI at the functional level - Also note that we can do this across many
languages
22Semantic Analysis Using The Technologies to Best
Advantage
- Semantic analysis tools which support concept
extraction, categorization, summarization and
pattern matching rules engines - Teragram works in 23 languages
- Use categorization to capture Topics, Business
Activities, Regions, Sectors, Themes, etc. - Use Concept Extraction to capture keywords
- Use Rules Engine to capture Loan , Credit ,
Project ID, Trust Fund , etc. - Use Summarization to generate a gist of the
content
23How does semantic analysis work?
24Semantic Analysis Basics
- Once you have made some sense of the sentence
(decompose), reconstruct entities for information
extraction (compose) - Identify names and other fixed form expressions
people, organizations, actions, relationships,
places - Identify basic noun groups, verb groups,
formatting elements, logic statements - Construct complex noun groups and verb groups
- Identify event structures
- Identify common elements and associate
25Leveraging the Topic Structure
- Each subtopic is a knowledge domain (hierarchical
taxonomy) - Each subtopic has an extensive concept level
definition (1,000 5,000 concepts) - Concepts are controlled vocabularies in their raw
form (flat taxonomy) - Concepts with relationships (extensive per new
Z39.19 standard) comprise semantic network
(network taxonomy) - Categorization tools work with topic structure
concept definitions to categorize and index
content - The following screen illustrates how that same
structure is embedded into Teragram profile to
support categorization
26Subtopics
Domain concepts or controlled vocabulary
27Extensive operators allow us to write grammatical
rules to manage typical semantic problems
28Concept based rules engine allows us to define
patterns to capture other kinds of data
29Example of use of Authority Control to capture
country names but extract authorized version of
country name
Example of use of a gazetteer concept
extraction rules engine to support semantic
interoperability
30Use of concept extraction rules engine to
capture Loan , Credit , Project ID
31(No Transcript)
32(No Transcript)
33Overview of Process Tools
34Enterprise Profile Creation and Maintenance
- Enterprise Metadata Profile
- Concept Extraction Technology
- Country
- Organization Name
- People Name
- Series Name/Collection Title
- Author/Creator
- Title
- Publisher
- Standard Statistical Variable
- Version/Edition
- Categorization Technology
- Topic Categorization
- Business Function Categorization
- Region Categorization
- Sector Categorization
- Theme Categorization
UCM Service Requests
Update Change Requests
Data Governance Process for Topics, Business
Function, Country, Region, Keywords, People,
Organizations, Project ID
e-CDS Reference Sources for Country, Region,
Topics Business Function, Keywords, Project ID,
People, Organization
Enterprise Profile Development Maintenance
JOLIS E-Journals
Factiva
ISP
TK240 Client
IRIS
ImageBank
Teragram Team
35Content Owners
Content Owners
Dedicated Server Teragram Semantic Engine
Concept Extraction, Categorization, Clustering,
Rule Based Engine, Language Detection
APIs Integration
APIs Integration
ISP Integration
IRIS Functional Team
IRIS Integration
Business Analyst
Enterprise Metadata Capture Strategy TK240
Client XML Output
Content Capture
Content Capture
XML Wrapped Metadata
XML Wrapped Metadata
APIs Integration
APIs Technical Integration
Enterprise Profile Development Maintenance
Factiva Metadata Database
ImageBank Integration
e-CDS Reference Sources
IDU Indexers
SITRC Librarians
Enterprise Metadata Capture Functional
Reference Model
36Impacts Outcomes
- Information Access impacts
- Increased precision of search
- Better control over recall
- Searching like we talk
- Exact match searching known item searching will
work better - Metadata based searching now begins to resemble
full-text searching but with all the advantages
of structure context, and a significant
reduction in the amount of noise - Productivity Improvements
- Can now assign deep metadata to all kinds of
content - Remove the human review aspect from the metadata
capture - Reduce unit times where human review is still
used - Information Quality impacts
- All metadata carries the information architecture
with it - Apply quality metrics at the metadata level to
eliminate need to build fuzzy search
architectures these rarely scale or improve in
performance - Use the technologies to identify and fix problems
with our data
37In Progress Impacts
- Same methodology can be leveraged to develop a
structure of lines of business, entities
prominent in particular domains, relationships
among entities in a domain, standard statistical
variables, etc. - The richer the metadata and the more fully
elaborated the reference structures, the closer
we come to understanding at a system level what
is happening in a particular domain at any point
in time - It is this overall structure which can then be
leveraged in other contexts, perhaps even a
counter-terrorism context, to threshold events - Without metadata, though, no information asset
can be secured but still its importance known - Without metadata, no information is agile or
mobile
38Thank You.