Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. - PowerPoint PPT Presentation

About This Presentation

Title:

Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR.

Description:

Dr. Rafail Ostrovsky, Telcordia Technologies, -algorithms ... Algorithm detects a new class. Human analyst labels it; determines its significance ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 51

Provided by: Kan143

Learn more at: http://kantor.comminfo.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR.

1
Libraries and IntelligenceNSF/NIJ Symposium on
Intelligence and Security Informatics. Tucson, AR.

Paul Kantor
June 2, 2003
Research supported in part by the National
Science Foundation under Grant EIA-0087022and by
the Advanced Research Development Activity under
Contract 2002-H790400-000. The
views expressed in this presentation are those of
the author, and do not necessarily represent the
views of the sponsoring agency.

2
Relation to General Intelligence and Security
Informatics

Signal information
Map and image information
Sound/voice information
Geographic information
Structured (Database) information
Free form textual information in machine readable
form

3
Relation to Librarianship

Much of the needed technology for managing
information related to homeland security is of
the same type that librarians have provided by
hand.
But ..
Millions of documents
dozens of languages
many media

4
Librarianship

Cataloging organizing information according to
what it is about
Classification Machine Learning
Use training examples
Adapt as more data is received
Filter huge streams of potentially relevant data
Monitoring Message Streams

5
Librarianship

Reference
Understand what the user wants
Understand both relevance and quality/genre
Learn from a dialog with the user
Intelligent Question Answering

6
Two Projects

Filtering/Monitoring Message Streams National
Science Foundation (NSF) -- acting for the
National Security Agency HITIQA - High quality
interactive Question Answering
Advanced Research Development Activity (ARDA) of
the Intelligence Community

7
OBJECTIVE
Monitor streams of textualized communication to
detect pattern changes and "significant" events

Motivation
monitoring of global satellite communications
(though this may produce voice rather than text)
sniffing and monitoring email traffic

8
(No Transcript)
9
MMS TeamStatisticians, computer scientists,
experts in info. Retrieval library science, etc
Dr. Rafail Ostrovsky, Telcordia Technologies,
-algorithms Prof. Endre Boros, --Boolean
optimization. Dr. Vladimir Menkov programming
Dr. Alex Genkin programming Mr. Andrei
Anghelescu graduate asisstant Mr. Dmitiry
Fradkin graduate assistant

Prof. Fred Roberts decision rules
Prof. David Madigan statistics
Dr. David Lewis text classification
Prof. Paul Kantor info science
Prof. Ilya Muchnik statistics
Prof. Muthu Muthukrishnan algorithms
Dr. Martin Strauss, ATT Labs algorithms

10
TECHNICAL PROBLEM

Given stream of text in any language.
Decide whether "new events" are present in the
flow of messages.
Event new topic or topic with unusual level of
activity.
Retrospective or Supervised Event
Identification Classification into pre-existing
classes.

More Complex Problem Prospective Detection or
Unsupervised Learning
Classes change - new classes or change meaning
A difficult problem in statistics
Recent new CS approaches
Algorithm detects a new class
Human analyst labels it determines its
significance

12
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text -- to meet storage and
processing limitations
(2). Representation of Text -- put in form
amenable to computation and statistical analysis
(3). Matching Scheme -- computing similarity
between documents
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(event)
(5). Fusion Scheme -- combine methods (scores) to
yield improved detection/clustering.

13
Fusion
Learning
Representation
Matching
Compression
Random Projections
Rocchio separator
Discriminant Analysis
Bag of Words
tf-idf
kNN
Boolean Random Projections
Bag of Bits
Naïve Bayes
Support Vector Machines
Boolean
Robust Feature Selection
Sparse Bayes
Non-linear Classifiers
r-NN
Combinatorial Clustering
Project Components Rutgers DIMACS MMS
14
Proposed Advances

Existing methods use some or all 5 automatic
processing components, but dont exploit the full
power of the components and/or an understanding
of how to apply them to text data.
Lewis' methods used an off-the-shelf support
vector machine supervised learner, but tuned it
for frequency properties of the data.Very good
TREC 2002 results on batch learning.
Chinese Academy of Sciences used most basic
linear classifier (Roccho model) and achieved the
best adaptive learning)

15
Proposed Advances II

We can trace a path (called a homotopy) in method
space, from a poor Rocchio model to the CAS one
-- find some better results along the way.
Next steps are
more sophisticated statistical methods
sophisticated data compression in a
pre-processing stage

16
MORE SOPHISTICATED STATISTICAL APPROACHES

Representations Boolean representations
weighting schemes
Matching Schemes Boolean matching nonlinear
transforms of individual feature values
Learning Methods new kernel-based methods
(nonlinear classification) more complex Bayes
classifiers to assign objects to highest
probability class
Fusion Methods combining scores based on ranks,
linear functions, or nonparametric schemes

17
THE APPROACH

Identify best combination of newer methods
through careful exploration of variety of tools.
Address issues of effectiveness (how well task is
done) and efficiency (in computational time and
space)
Use combination of new or modified algorithms and
improved statistical methods built on the
algorithmic primitives.
Systematic Experimentation on components and on
fusion schemes

18
Mercer Kernels
K pos. semi-definite
Mercers Theorem gives necessary and sufficient
conditions for a continuous symmetric function K
to admit this representation Mercer
Kernels This kernel defines a set of functions
HK, elements of which have an expansion
as This set of functions is a reproducing
kernel hilbert space
Prepared by David L. Madigan
19
Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ?s are
usually zero xs corresponding the the non-zero
?s are the support vectors.
tuning constant
complexity penalty
Gram matrix
Prepared by David L. Madigan
20
Regularized Linear Feature Space Model
Choose a model of the form to
minimize Solution is finite dimensional
prediction is sign(f(x))
just need to know K, not ? !
A kernel is a function K, such that for all x,z
?X where ? is a mapping from X to an inner
product feature space F
Prepared by David L. Madigan
21
Mixture Models

Pr(dRel)af(d)(1-a)g(d)
f, g may be centered at different points in
document space. So distinct conceptual
representations are accommodated easily.
Examples multinomial distributions.

22
Example Results on Fusion

http//dimacspc6.rutgers.edu/dfradkin/fusion/cent
roid/try.pdf
http//dimacspc6.rutgers.edu/dfradkin/applet/topi
cShowApplet.jsp
60,000 documents.

23
Learning takes place in two spaces For matching
and filtering, we learn rules in the primary
space of document features. For fusion processes
we learn rules in a secondary space of
pseudo-features which are assigned by entire
systems, to incoming documents.
Random Subspace
Feature space
Score space
Relevant
Relevant
24
REFERENCE ASPECT

Effective Communication with the Analyst User

25
HITIQA High-Quality Interactive Question
Answering

University at Albany, SUNY
Rutgers University

26
HITIQA Team

SUNY Albany
Prof. Tomek Strzalkowski, PI/PM
Prof. Rong Tang
Prof. Boris Yamrom, consultant
Ms. Sharon Small, Research Scientist
Mr. Ting Liu, Graduate Student
Mr. Nobuyuki Shimizu, Graduate Student
Mr. Tom Palen, summer intern
Mr. Peter LaMonica, summer intern/AFRL
Rutgers
Prof. Paul Kantor, co-PI
Prof. K.B. Ng
Prof. Nina Wacholder
Mr. Robert Rittman, Graduate Student
Ms. Ying Sun, Graduate Student
Mr. Peng Song, Graduate student

27
HITIQA Concept
28
Key Research Issues

Question Semantics
how the system understands user requests
Human-Computer Dialogue
how the user and the system negotiate this
understanding
Information Quality Metrics
how some information is better than other
Information Fusion
how to assemble the answer that fits user needs.

29
DB

Document Retrieval
Segment/ Filter
Question Processor
question
Cluster Segments

Dialogue Manager
Query Refinement

Build Frames
Gate
answer
Process Frames
Answer Generator
Visualization
Wordnet
Current Focus

Completed Work

30
Data-Driven NL Semantics

What does the question mean to the user?
The speech act
The focus
Users task, intention, goal
Users background knowledge

What does the question mean to the system?
Available information
Information that can be retrieved
The dimensions of the retrieved information

User Semantics
System Semantics
31
Answer Space Topology
ALL RETRIEVED FRAMES
KERNEL QUESTION MATCH
NEAR MISSES, ALTERNATIVE INTERPRETATIONS
32
Quality Judgments

Focus Group
Sessions conducted March-April, 2002
Results Nine quality aspects generated
Expert Sessions
Sessions Conducted May-June, 2002
Results 100 documents scored twice along 9
quality aspects
Student Sessions
Training and Testing Sessions June-July, 2002
10 documents judged by experts used for
training/testing
Actual Judgment Sessions June-August, 2002
Qualified students evaluated 10 documents per
session
Results 900 documents scored twice along 9
quality aspects

33
Factor Analysis of 9 Quality Features
34
Modeling Quality of Text

Kitchen sink approach
160 independent variables
Part-of-speech, vocabulary
stylistics, named entities,
Statistical pruning
Statistically significant variables
May be nonsensical to human
Human pruning
Only sensible variables retained for each
quality
Pruning improves performance
Kitchen sink overfits
Statistics and Human close in performance
More work needed to understand the relationship

35
Performance of models
36
(No Transcript)
37
In Summary

The two conceptual foundations of librarianship
cataloging and reference, translate to two
important problems in managing streams of textual
messages
Both involve pattern recognition or machine
learning.

38
Two Roles for Learning

Cataloging learning which features of a message
mean that it is significant to the problem at
hand
Reference learning which features of a message
mean that it is salient to a specific user of
the system.

39
AppendixThe following slides were not presented
at the conference.
40
Communicating Credibility

A system that is correct 75 or 80 of the time
will be wrong one time in every four or 5.
Unless it can shade its judgments or
recommendations, the analyst will lose confidence
in it.
Credibility must be high enough to avoid
extensive rework.

41
Data Fusion

Use multiple methods to assess the relevance of
documents or passages,
For a given question, dialogue, or cluster
Each method assigns a score
Candidates ? points in a score space
Seek patterns to localize the most relevant
documents or passages in this score space
Developed interactive data analysis tool

42
Background on Fusion Problem

There are systems S, T, U,
There are problems to be solved P,Q,R
This defines several fusion problems
Local fusion for a given problem P, and a pair
of systems S,T, what is the best fusion rule
Let s(d) ,t(d) be the scores assigned to
document d by systems S and T. Fusion tries to
find the best combining function f(s,t)

43
Non-linear iso-relevance
44
Local Fusion Rule

A local fusion rule fP(s,t) depends on the
specific problem P.
This is relevant if P represents a static problem
or profile, which will be considered on many
occasions
A global fusion rule f(s,t) does not depend on a
specific problem P,
and can be safely used on a variety of problems.

45
Local Fusion Results are Good

Completely rigorous For each topic
1) Randomly split the documents into two parts
training and testing
2) Do the logistic regression on training part
and get the fusion scores for both training and
testing documents
3) Calculate p_100 on testing documents.
4) Excellent results (one random sample for each)
5) Test SMART and InQuery on the same random
testing set

46
Summary of Local Fusion
PROBLEM CASE We ran 5 split half runs on the odd
case (318) and the results persist.
47
Is Local Sensible?

Local fusion depends on getting information about
a particular topic, and doing the best possible
fusion.
Not available in an AdHoc (e.g. Google) setting
Potentially available in an intelligence
applications - -filtering standing profile

48
(No Transcript)
49
(No Transcript)
50
Our Approach to Retrieval Fusion
Adaptive Local Fusion
Request
SMART
Result Set
USE Better System
InQuery
Result Set
DOCUMENTS
SETS
FUSION
Monitor Fusion Set and Receive Feedback
PROCESS
Delivered SET
ADOPT Fusion System

Write a Comment

User Comments (0)