Leveraging the Unstructured Data - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Leveraging the Unstructured Data

Description:

Aerial Photogrammetry. etc. 26 EDS Corp. 2002. Unstructured Data ... Elements of Photogrammetry with Applications in GIS, by Paul R. Wolf, Bon A. Dewitt ... – PowerPoint PPT presentation

Number of Views:705
Avg rating:3.0/5.0
Slides: 49
Provided by: aria73
Category:

less

Transcript and Presenter's Notes

Title: Leveraging the Unstructured Data


1
Leveraging the Unstructured Data
Presented at ICE 2002 Edmonton AB, Canada October
22, 2002
  • Kas Kasravi
  • EDS Fellow

2
Topics
  • Summary
  • What is Unstructured Data?
  • The Value in Unstructured Data
  • Technologies
  • Text Mining
  • Audio Mining
  • Image Mining
  • Unstructured Data Management Issues
  • Examples / Demos

3
Summary
  • Unstructured data consists of text, audio, images
    etc.
  • Technologies and tools exist for leveraging the
    value in unstructured data
  • Unstructured data contains significant business
    value
  • The value in unstructured data is mostly untapped
  • A paradigm shift is needed

4
What is Unstructured Data?
  • Any data without a well-defined model for
    information access
  • Examples,
  • Word documents
  • E-mails
  • Examples of what is structured
  • Database tables
  • Objects
  • XML tags

5
Unstructured Data Management (UDM)
  • The process of mining and analyzing unstructured
    data to capture actionable information
  • Market size
  • 100M for text mining
  • Much greater IT impact

6
The Value in Unstructured Data
  • Amount of text-based data alone will grow to
    over 800 terabytes by 2004 - Forrester
  • Amount of unstructured data in large
    corporations doubles every 2 months - IDC
  • Companies with a UDM system in place are, on
    average, at least 15 more productive - Basex

7
The Value in Unstructured Data
  • The average knowledge worker spends 2.5 hours
    per day searching for documents - IDC March
    2002
  • 80-90 of information on the net and corporate
    networks is unstructured - Goldman Sachs

Only if we could know what we know
8
UDM Increases Informational Content

Structured Data (10-40)
Unstructured Data (60-90)
9
UDM Complements Structured Data
Consolidated Data
Structured Data
Greater Value
Contextual Information
UDM
Unstructured Data
10
The Value in Unstructured Data
  • Business Value
  • Better information
  • More timely information
  • More relevant information
  • Better decision support
  • IT Impact
  • More information to store and manage
  • More complex analysis
  • Great business impact

Source META Group, 9/20/2001
11
UDM Technologies
12
Text Mining
  • The process of extracting information from
    textual data, and utilizing it for better
    business decisions
  • Based on multiple technologies, e.g.,
  • Computational linguistics
  • Statistics
  • A new business intelligence tool
  • Focus on semantics and not keywords
  • An emerging technology

13
Computational Linguistics
  • Definition
  • Study of computer algorithms for
  • Natural language understanding
  • Natural language generation
  • Objectives
  • Machine translation
  • Information retrieval
  • Human-Machine interface
  • Early work began in 1950s

14
Syntax Analysis
  • Structure Determination Generation of a parse
    tree using a grammar
  • Regularizing the syntactic structure
  • Restricting large number of possible structures
    to a small number

Sentence
Subject
Verb Phrase
Verb
Object
Mary
eats
cheese
15
Semantic Analysis
  • Example of ambiguities at the syntactic level
  • I saw a man in the park with a telescope
  • Semantic analysis is required
  • Synonyms
  • Deep parsing
  • Prior knowledge

16
Categories of Text-Mining
  • Feature Extraction
  • Entities (e.g., names, companies, places)
  • Events (e.g., mergers, elections, sales)
  • Relations among entities and events
  • Document Categorization
  • Grouping multiple articles based on their
    contextual similarities
  • Summarization
  • A condensed version of one or more documents
  • Thematic Analysis
  • Discovery of the theme/context within a document

17
Example of Feature Extraction
Document

Extracted Information

Profits at Canada

s six big banks
Event
Profits
topped C

6

topped C6 billion (4.4 billion)
Country Canada

in 1996, smashing last year
s C5.2
Entity
B
ig banks

billion
(3.8 billion) record as
Organization

Canad
ian
Imperial
Canad
ian
Imperial Bank of Commerce
Bank of Comme
rce

and National Bank of Canada wrapped
Organization

National Bank of
up the earnings season Thursda
y.
Canada

The six banks each reported a
Date Earnings season

double
-
digit jump in net income for
Date
Fiscal 1996

a combined profit of C6.26 billion
(4.6 billion) in fiscal 1996 ended
Oct.

31.

18
Sources of Text
  • Textual documents
  • Corporate intranets
  • News
  • Chat rooms
  • Web pages
  • E-mails
  • Faxes
  • etc.

19
Sample Applications
  • News analysis for evidence gathering
  • Patent analysis
  • E-mail routing
  • Competitive intelligence
  • Warranty claims analysis
  • CRM
  • Content management
  • Market research
  • Recruiting
  • eLearning
  • Automated help-desks Chat room monitoring
  • Web page monitoring
  • Document clustering
  • Legacy document conversion
  • Machine translation
  • Knowledge management
  • Intelligent search engines
  • e-Procurement

20
Audio Mining
  • Analysis of audio data
  • Speech
  • Music
  • Other sounds
  • Goal Extract information from audio
  • Who is the speaker
  • What is said
  • Defect detection
  • Music identification
  • Sonar object recognition
  • Telecommunications monitoring

21
Audio Mining
  • Analysis is based on audio attributes, e.g.,
  • Volume
  • Pitch
  • Timber
  • Sources of audio for analysis
  • Voice recordings
  • Factory sounds
  • Telecommunications
  • Broadcasts
  • etc.

22
Sample Applications
  • Broadcast content management
  • Call center automation
  • CRM
  • Manufacturing quality control
  • Music retrieval (query by humming)
  • Security
  • etc.

23
Image Mining
  • Analysis of digital images
  • Pictures
  • Drawings
  • Videos
  • Goal Extract information from images
  • Face recognition
  • Defect detection
  • Object recognition
  • Action/event detection

24
Image Mining
  • Analysis is based on spatial attributes e.g.,
  • Color
  • Size
  • Texture (macro and micro)
  • Shapes/outlines/shadows
  • Sources of images
  • Digital photographs
  • Surveillance cameras
  • Satellite images
  • Broadcasts
  • etc.

25
Sample Applications
  • Manufacturing quality control
  • Broadcast content management
  • Remote Sensing
  • Security and authentication
  • Forensics
  • Video logs
  • Geophysics
  • Aerial Photogrammetry
  • etc.

26
Unstructured Data Management Issues
  • Metrics
  • Commercial Tools
  • Related Technologies
  • Challenges

27
Metrics
  • Accuracy
  • Percentage of extracted information that is
    correct
  • Thoroughness
  • Percentage of facts extracted that were present
  • Focus
  • Percentage of extracted information that is
    relevant and useful

28
Sample UDM Tools
  • APR/Smartlogik
  • Autonomy
  • Clairvoyance
  • ClearForest
  • Entrieva
  • Insightful
  • MAMI
  • Megaputer

Partial list of products. The author does not
recommend any products.
29
Related Technologies
  • Business Intelligence
  • Knowledge Management
  • Content Management
  • eLearning
  • Collaboration
  • Innovation Management
  • Sales Force Automation
  • Data Mining
  • Visualization

30
Sample Architecture
Structured Data
Rules
Data Warehouse
Information Extraction
Unstructured Data
Analysis and Decision Support
Visualization
31
Challenges
  • Paradigm Shift
  • Which applications? What to analyze?
  • Wheres the ROI? What are the risks?
  • Business Readiness
  • Technology Maturity
  • Ambiguity Resolution
  • Uniqueness
  • Timeliness
  • Context
  • Testing
  • Efficacy
  • Adverse Reaction

32
Bibliography
  • Computational Auditory Scene Analysis, by David
    F. Rosenthal (Editor), Hiroshi G. Okuno (Editor)
  • Computational Linguistics An introduction, by
    Ralph Grishman, Cambridge University Press
  • Elements of Photogrammetry with Applications in
    GIS, by Paul R. Wolf, Bon A. Dewitt
  • Emerging Solutions for Managing Unstructured CRM
    Data, by Richard Peynot (Giga Information group)
  • Foundations of Statistical Natural Language
    Processing, by Christopher D. Manning, Hinrich
    Schutze
  • Proceedings of ACM SIGKDD Knowledge Discovery
    and Data Mining Conference, 1999

33
Examples / Demos
34
Examples / Demos
  • EDS - Bank of Knowledge
  • BBC - Neon
  • ClearForest Patent Analysis
  • Insightful Aerial Photogrammetry
  • EDS Securities Fraud Detection
  • Not included in handouts and files

35
Bank of Knowledge
  • EDS Global Purchasing

36
EDS Bank of Knowledge Project
  • Supply chain intelligence
  • EDS has 35,000 active supplier contracts
  • Not possible to read, understand, and utilize all
    the terms in those contracts
  • Revenue and cost reduction opportunities are lost
    because some contractual terms are not known and
    not enforced, e.g.,
  • Discounts offer cost reductions
  • Refunds offer revenue generation

37
Features
  • Modules
  • Spend Management
  • Compliance Management
  • Supplier Intelligence
  • Contracts Management
  • Seamless integration of technologies
  • Text-mining
  • Data mining
  • Business intelligence
  • Advanced visualization

38
Sample Contract Attributes
  • Pricing
  • Discounts
  • Margins
  • Pro-rata Refunds
  • Levels of Support
  • License Clauses
  • Contract Amendments
  • Confidentiality
  • Warranty Information
  • Freight Information

BOK correlates above attributes with other
procurement functions
39
Metrics
  • 4000 average cost to manually create, execute,
    manage and track a single contract
  • Average 2,000 savings per contract
    reviewed/enforced
  • 3-5 cost savings based on addressable spend
    patterns
  • 12-15 improved productivity
  • ROI in about 6 months of usage

40
BBC Neon
  • APR/Smartlogik

41
BBC Neon
  • NEws information Online
  • A BBC application developed by APR/Smartlogik
  • Robust online news archive solution
  • Concept-based news search engine
  • 5000 users
  • Provides a varied selection of news publications
  • Retention of the BBC's existing taxonomy combined
    with the flexibility to update this structure

Example Courtesy APR and BBC
42
Patent Analysis
  • ClearForest

43
Patent Analysis
  • Licensing
  • Development
  • Asset Evaluation
  • Recruiting opportunities
  • Prosecution
  • Litigation

44
Example Patent Analysis with ClearForest
  • Most referenced patents

45
  • Most active inventors

46
  • Most active companies

47
  • Fuel cell inventors working together

48
  • Link to any patent document, highlighting key
    words
Write a Comment
User Comments (0)
About PowerShow.com