Title: Analysis of Uncertain Data in Text Documents
1Analysis of Uncertain Datain Text Documents
PAINT
- Carnegie Mellon University and DYNAMiX
Technologies - PI Jaime G. Carbonell / jgc_at_cs.cmu.edu / (412)
268-7279 - Co-PI Eugene Fink / e.fink_at_cs.cmu.edu / (412)
268-6593 - HNC and Fair Isaac
- Co-PI Dayne Freitag / daynefreitag_at_fairisaac.com
/ (858) 369-8191 - Co-PI Richard Rohwer / richardrohwer_at_fairisaac.co
m / (858) 369-8318
2Proposed functionality
We will integrate the text-extraction system
developed by HNC / Fair Isaac with the
uncertainty-analysis system developed by CMU /
DYNAMiX. The integrated system will support the
following main capabilities.
- Extraction of relevant facts, relations, and
causal links from natural-language text documents - Automated intent inferences and identification of
surprising developments based on uncertain data - Evaluation of given hypotheses
- Proactive information gathering
- Application to the analysis of Iranian
nanotechnology plans and capabilities
We will also build an external API for future
integration with other PAINT systems, and
evaluate its effectiveness by implementing an
optional loose integration with the
predictive-analysis system developed by Berkeley
/ LLC.
3HNC / Fair IsaacREALISM System
Extracted relations and causal links(structured
rules)
Knowledge baseentities, relations, implication
pool
Abstract IE model learning background
Extracted facts and entities(structured tables)
Basic IE model learning background
Information extraction (entities and relations)
IE models
Genre detection
Academic
TEXT DOCUMENTS
Unstructured text archive by genre
Newswire
Data acquisition Real-time IR
Blog
...
WEB
Background / model-learning data paths Real-time
/ modeling data paths
4HNC / Fair IsaacREALISM System
- Output
- Large structured tables of relevant facts and
entities, which include uncertainty - Inference-rule representation of relations and
causal links, also including uncertainty
- Input
- Requirements and filters for the information
extraction - Natural-language documents
- World-wide web
5CMU / DYNAMiXRAPID System
Manual entry,selection, andediting ofknowledge
Prioritized plans for proactivedata collection
Learnedinferencerules
RAPID Inference Engine
RAPID Proactive Planner
Criticaluncertainties
Inferredfacts
Evaluation ofhypotheses
Querymatches
6CMU / DYNAMiXRAPID System
- Output
- Inferences from uncertain data
- New learned inference rules
- Exact and approximatematches for given queries
- Hypothesis assessment
- Proactive plans for collectingadditional data
- Input
- Reality interpretation tables, which represent
uncertain facts - Uncertain inference rules
- Queries for specific relevant data
- Analysts hypotheses
7Integrated system
Manual entry,selection, andediting ofknowledge
Informationrequests
Topicfilters
Structured relations andcausal links
TEXT DOCUMENTS
Learnedinferencerules
Plans forproactivedata collection
REALISM
RAPID
Structuredfacts andentities
Inferredfacts
WEB
Evaluation ofhypotheses
Querymatches
HNC / Fair Isaac
CMU / DYNAMiX
External API
OTHER PAINT SYSTEMS
Testing with Berkeley / LLC System
8Empirical evaluation
Data We will use public data about Iranian
nanotechnology in the system evaluation. When the
PAINT challenge-problem data about Iran becomes
available, we will combine it with the public
data.
- Component evaluation
- We will measure the following performance
factors - Accuracy and completeness of text extraction
- Accuracy of hypothesis evaluation
- Effectiveness of data-collection plans
- Speed of each system component
9Empirical evaluation
Evaluation of the integrated system We will
compare the productivity of subjects usingthe
developed system with that of subjects
whoperform the same tasks using off-the-shelf
tools.
- Specific tasks
- Find data relevant to given hypotheses
- Evaluate the validity of these hypotheses
- Identify critical uncertainties and propose a
plan for collecting additional relevant data
- Performance measurements
- Number of tasks completed during the experiment
- Accuracy of hypothesis evaluation
- Effectiveness of proactive data-collection plans
Experimental group Use of REALISM / RAPID
Control group Use of standard tools
10Empirical evaluation
Component utility We will also evaluate the
utility of REALISM and RAPID by comparing the
productivity of subjects under the following
three conditions
- Use of the integrated system
- Use of REALISM without RAPID
- Use of RAPID without REALISM
11Schedule