Title: Leveraging the Unstructured Data
1Leveraging the Unstructured Data
Presented at ICE 2002 Edmonton AB, Canada October
22, 2002
2Topics
- Summary
- What is Unstructured Data?
- The Value in Unstructured Data
- Technologies
- Text Mining
- Audio Mining
- Image Mining
- Unstructured Data Management Issues
- Examples / Demos
3Summary
- Unstructured data consists of text, audio, images
etc. - Technologies and tools exist for leveraging the
value in unstructured data - Unstructured data contains significant business
value - The value in unstructured data is mostly untapped
- A paradigm shift is needed
4What is Unstructured Data?
- Any data without a well-defined model for
information access - Examples,
- Word documents
- E-mails
- Examples of what is structured
- Database tables
- Objects
- XML tags
5Unstructured Data Management (UDM)
- The process of mining and analyzing unstructured
data to capture actionable information - Market size
- 100M for text mining
- Much greater IT impact
6The Value in Unstructured Data
- Amount of text-based data alone will grow to
over 800 terabytes by 2004 - Forrester - Amount of unstructured data in large
corporations doubles every 2 months - IDC - Companies with a UDM system in place are, on
average, at least 15 more productive - Basex
7The Value in Unstructured Data
- The average knowledge worker spends 2.5 hours
per day searching for documents - IDC March
2002 - 80-90 of information on the net and corporate
networks is unstructured - Goldman Sachs
Only if we could know what we know
8UDM Increases Informational Content
Structured Data (10-40)
Unstructured Data (60-90)
9UDM Complements Structured Data
Consolidated Data
Structured Data
Greater Value
Contextual Information
UDM
Unstructured Data
10The Value in Unstructured Data
- Business Value
- Better information
- More timely information
- More relevant information
- Better decision support
- IT Impact
- More information to store and manage
- More complex analysis
- Great business impact
Source META Group, 9/20/2001
11UDM Technologies
12Text Mining
- The process of extracting information from
textual data, and utilizing it for better
business decisions - Based on multiple technologies, e.g.,
- Computational linguistics
- Statistics
- A new business intelligence tool
- Focus on semantics and not keywords
- An emerging technology
13Computational Linguistics
- Definition
- Study of computer algorithms for
- Natural language understanding
- Natural language generation
- Objectives
- Machine translation
- Information retrieval
- Human-Machine interface
- Early work began in 1950s
14Syntax Analysis
- Structure Determination Generation of a parse
tree using a grammar - Regularizing the syntactic structure
- Restricting large number of possible structures
to a small number
Sentence
Subject
Verb Phrase
Verb
Object
Mary
eats
cheese
15Semantic Analysis
- Example of ambiguities at the syntactic level
- I saw a man in the park with a telescope
- Semantic analysis is required
- Synonyms
- Deep parsing
- Prior knowledge
16Categories of Text-Mining
- Feature Extraction
- Entities (e.g., names, companies, places)
- Events (e.g., mergers, elections, sales)
- Relations among entities and events
- Document Categorization
- Grouping multiple articles based on their
contextual similarities - Summarization
- A condensed version of one or more documents
- Thematic Analysis
- Discovery of the theme/context within a document
17Example of Feature Extraction
Document
Extracted Information
Profits at Canada
s six big banks
Event
Profits
topped C
6
topped C6 billion (4.4 billion)
Country Canada
in 1996, smashing last year
s C5.2
Entity
B
ig banks
billion
(3.8 billion) record as
Organization
Canad
ian
Imperial
Canad
ian
Imperial Bank of Commerce
Bank of Comme
rce
and National Bank of Canada wrapped
Organization
National Bank of
up the earnings season Thursda
y.
Canada
The six banks each reported a
Date Earnings season
double
-
digit jump in net income for
Date
Fiscal 1996
a combined profit of C6.26 billion
(4.6 billion) in fiscal 1996 ended
Oct.
31.
18Sources of Text
- Textual documents
- Corporate intranets
- News
- Chat rooms
- Web pages
- E-mails
- Faxes
- etc.
19Sample Applications
- News analysis for evidence gathering
- Patent analysis
- E-mail routing
- Competitive intelligence
- Warranty claims analysis
- CRM
- Content management
- Market research
- Recruiting
- eLearning
- Automated help-desks Chat room monitoring
- Web page monitoring
- Document clustering
- Legacy document conversion
- Machine translation
- Knowledge management
- Intelligent search engines
- e-Procurement
20Audio Mining
- Analysis of audio data
- Speech
- Music
- Other sounds
- Goal Extract information from audio
- Who is the speaker
- What is said
- Defect detection
- Music identification
- Sonar object recognition
- Telecommunications monitoring
21Audio Mining
- Analysis is based on audio attributes, e.g.,
- Volume
- Pitch
- Timber
- Sources of audio for analysis
- Voice recordings
- Factory sounds
- Telecommunications
- Broadcasts
- etc.
22Sample Applications
- Broadcast content management
- Call center automation
- CRM
- Manufacturing quality control
- Music retrieval (query by humming)
- Security
- etc.
23Image Mining
- Analysis of digital images
- Pictures
- Drawings
- Videos
- Goal Extract information from images
- Face recognition
- Defect detection
- Object recognition
- Action/event detection
24Image Mining
- Analysis is based on spatial attributes e.g.,
- Color
- Size
- Texture (macro and micro)
- Shapes/outlines/shadows
- Sources of images
- Digital photographs
- Surveillance cameras
- Satellite images
- Broadcasts
- etc.
25Sample Applications
- Manufacturing quality control
- Broadcast content management
- Remote Sensing
- Security and authentication
- Forensics
- Video logs
- Geophysics
- Aerial Photogrammetry
- etc.
26Unstructured Data Management Issues
- Metrics
- Commercial Tools
- Related Technologies
- Challenges
27Metrics
- Accuracy
- Percentage of extracted information that is
correct - Thoroughness
- Percentage of facts extracted that were present
- Focus
- Percentage of extracted information that is
relevant and useful
28Sample UDM Tools
- APR/Smartlogik
- Autonomy
- Clairvoyance
- ClearForest
- Entrieva
- Insightful
- MAMI
- Megaputer
Partial list of products. The author does not
recommend any products.
29Related Technologies
- Business Intelligence
- Knowledge Management
- Content Management
- eLearning
- Collaboration
- Innovation Management
- Sales Force Automation
- Data Mining
- Visualization
30Sample Architecture
Structured Data
Rules
Data Warehouse
Information Extraction
Unstructured Data
Analysis and Decision Support
Visualization
31Challenges
- Paradigm Shift
- Which applications? What to analyze?
- Wheres the ROI? What are the risks?
- Business Readiness
- Technology Maturity
- Ambiguity Resolution
- Uniqueness
- Timeliness
- Context
- Testing
- Efficacy
- Adverse Reaction
32Bibliography
- Computational Auditory Scene Analysis, by David
F. Rosenthal (Editor), Hiroshi G. Okuno (Editor) - Computational Linguistics An introduction, by
Ralph Grishman, Cambridge University Press - Elements of Photogrammetry with Applications in
GIS, by Paul R. Wolf, Bon A. Dewitt - Emerging Solutions for Managing Unstructured CRM
Data, by Richard Peynot (Giga Information group) - Foundations of Statistical Natural Language
Processing, by Christopher D. Manning, Hinrich
Schutze - Proceedings of ACM SIGKDD Knowledge Discovery
and Data Mining Conference, 1999
33Examples / Demos
34Examples / Demos
- EDS - Bank of Knowledge
- BBC - Neon
- ClearForest Patent Analysis
- Insightful Aerial Photogrammetry
- EDS Securities Fraud Detection
- Not included in handouts and files
35Bank of Knowledge
36EDS Bank of Knowledge Project
- Supply chain intelligence
- EDS has 35,000 active supplier contracts
- Not possible to read, understand, and utilize all
the terms in those contracts - Revenue and cost reduction opportunities are lost
because some contractual terms are not known and
not enforced, e.g., - Discounts offer cost reductions
- Refunds offer revenue generation
37Features
- Modules
- Spend Management
- Compliance Management
- Supplier Intelligence
- Contracts Management
- Seamless integration of technologies
- Text-mining
- Data mining
- Business intelligence
- Advanced visualization
38Sample Contract Attributes
- Pricing
- Discounts
- Margins
- Pro-rata Refunds
- Levels of Support
- License Clauses
- Contract Amendments
- Confidentiality
- Warranty Information
- Freight Information
BOK correlates above attributes with other
procurement functions
39Metrics
- 4000 average cost to manually create, execute,
manage and track a single contract - Average 2,000 savings per contract
reviewed/enforced - 3-5 cost savings based on addressable spend
patterns - 12-15 improved productivity
- ROI in about 6 months of usage
40BBC Neon
41BBC Neon
- NEws information Online
- A BBC application developed by APR/Smartlogik
- Robust online news archive solution
- Concept-based news search engine
- 5000 users
- Provides a varied selection of news publications
- Retention of the BBC's existing taxonomy combined
with the flexibility to update this structure
Example Courtesy APR and BBC
42Patent Analysis
43Patent Analysis
- Licensing
- Development
- Asset Evaluation
- Recruiting opportunities
- Prosecution
- Litigation
44Example Patent Analysis with ClearForest
45 46 47- Fuel cell inventors working together
48- Link to any patent document, highlighting key
words