Title: Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR.
1Libraries and IntelligenceNSF/NIJ Symposium on
Intelligence and Security Informatics. Tucson, AR.
- Paul Kantor
- June 2, 2003
- Research supported in part by the National
Science Foundation under Grant EIA-0087022and by
the Advanced Research Development Activity under
Contract 2002-H790400-000. The - views expressed in this presentation are those of
the author, and do not necessarily represent the
views of the sponsoring agency.
2Relation to General Intelligence and Security
Informatics
- Signal information
- Map and image information
- Sound/voice information
- Geographic information
- Structured (Database) information
- Free form textual information in machine readable
form
3Relation to Librarianship
- Much of the needed technology for managing
information related to homeland security is of
the same type that librarians have provided by
hand. - But ..
- Millions of documents
- dozens of languages
- many media
4Librarianship
- Cataloging organizing information according to
what it is about - Classification Machine Learning
- Use training examples
- Adapt as more data is received
- Filter huge streams of potentially relevant data
- Monitoring Message Streams
5Librarianship
- Reference
- Understand what the user wants
- Understand both relevance and quality/genre
- Learn from a dialog with the user
- Intelligent Question Answering
6Two Projects
- Filtering/Monitoring Message Streams National
Science Foundation (NSF) -- acting for the
National Security Agency HITIQA - High quality
interactive Question Answering - Advanced Research Development Activity (ARDA) of
the Intelligence Community
7OBJECTIVE
Monitor streams of textualized communication to
detect pattern changes and "significant" events
- Motivation
- monitoring of global satellite communications
(though this may produce voice rather than text) - sniffing and monitoring email traffic
8(No Transcript)
9MMS TeamStatisticians, computer scientists,
experts in info. Retrieval library science, etc
Dr. Rafail Ostrovsky, Telcordia Technologies,
-algorithms Prof. Endre Boros, --Boolean
optimization. Dr. Vladimir Menkov programming
Dr. Alex Genkin programming Mr. Andrei
Anghelescu graduate asisstant Mr. Dmitiry
Fradkin graduate assistant
- Prof. Fred Roberts decision rules
- Prof. David Madigan statistics
- Dr. David Lewis text classification
- Prof. Paul Kantor info science
- Prof. Ilya Muchnik statistics
- Prof. Muthu Muthukrishnan algorithms
- Dr. Martin Strauss, ATT Labs algorithms
10TECHNICAL PROBLEM
- Given stream of text in any language.
- Decide whether "new events" are present in the
flow of messages. - Event new topic or topic with unusual level of
activity. - Retrospective or Supervised Event
Identification Classification into pre-existing
classes.
11- More Complex Problem Prospective Detection or
Unsupervised Learning - Classes change - new classes or change meaning
- A difficult problem in statistics
- Recent new CS approaches
- Algorithm detects a new class
- Human analyst labels it determines its
significance
12COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
- (1). Compression of Text -- to meet storage and
processing limitations - (2). Representation of Text -- put in form
amenable to computation and statistical analysis - (3). Matching Scheme -- computing similarity
between documents - (4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(event) - (5). Fusion Scheme -- combine methods (scores) to
yield improved detection/clustering.
13Fusion
Learning
Representation
Matching
Compression
Random Projections
Rocchio separator
Discriminant Analysis
Bag of Words
tf-idf
kNN
Boolean Random Projections
Bag of Bits
Naïve Bayes
Support Vector Machines
Boolean
Robust Feature Selection
Sparse Bayes
Non-linear Classifiers
r-NN
Combinatorial Clustering
Project Components Rutgers DIMACS MMS
14Proposed Advances
- Existing methods use some or all 5 automatic
processing components, but dont exploit the full
power of the components and/or an understanding
of how to apply them to text data. - Lewis' methods used an off-the-shelf support
vector machine supervised learner, but tuned it
for frequency properties of the data.Very good
TREC 2002 results on batch learning. - Chinese Academy of Sciences used most basic
linear classifier (Roccho model) and achieved the
best adaptive learning)
15Proposed Advances II
- We can trace a path (called a homotopy) in method
space, from a poor Rocchio model to the CAS one
-- find some better results along the way. - Next steps are
- more sophisticated statistical methods
- sophisticated data compression in a
pre-processing stage
16MORE SOPHISTICATED STATISTICAL APPROACHES
- Representations Boolean representations
weighting schemes - Matching Schemes Boolean matching nonlinear
transforms of individual feature values - Learning Methods new kernel-based methods
(nonlinear classification) more complex Bayes
classifiers to assign objects to highest
probability class - Fusion Methods combining scores based on ranks,
linear functions, or nonparametric schemes
17THE APPROACH
- Identify best combination of newer methods
through careful exploration of variety of tools. - Address issues of effectiveness (how well task is
done) and efficiency (in computational time and
space) - Use combination of new or modified algorithms and
improved statistical methods built on the
algorithmic primitives. - Systematic Experimentation on components and on
fusion schemes
18Mercer Kernels
K pos. semi-definite
Mercers Theorem gives necessary and sufficient
conditions for a continuous symmetric function K
to admit this representation Mercer
Kernels This kernel defines a set of functions
HK, elements of which have an expansion
as This set of functions is a reproducing
kernel hilbert space
Prepared by David L. Madigan
19Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ?s are
usually zero xs corresponding the the non-zero
?s are the support vectors.
tuning constant
complexity penalty
Gram matrix
Prepared by David L. Madigan
20Regularized Linear Feature Space Model
Choose a model of the form to
minimize Solution is finite dimensional
prediction is sign(f(x))
just need to know K, not ? !
A kernel is a function K, such that for all x,z
?X where ? is a mapping from X to an inner
product feature space F
Prepared by David L. Madigan
21Mixture Models
- Pr(dRel)af(d)(1-a)g(d)
- f, g may be centered at different points in
document space. So distinct conceptual
representations are accommodated easily. - Examples multinomial distributions.
22Example Results on Fusion
- http//dimacspc6.rutgers.edu/dfradkin/fusion/cent
roid/try.pdf - http//dimacspc6.rutgers.edu/dfradkin/applet/topi
cShowApplet.jsp - 60,000 documents.
23Learning takes place in two spaces For matching
and filtering, we learn rules in the primary
space of document features. For fusion processes
we learn rules in a secondary space of
pseudo-features which are assigned by entire
systems, to incoming documents.
Random Subspace
Feature space
Score space
Relevant
Relevant
24REFERENCE ASPECT
- Effective Communication with the Analyst User
25HITIQA High-Quality Interactive Question
Answering
- University at Albany, SUNY
- Rutgers University
26HITIQA Team
- SUNY Albany
- Prof. Tomek Strzalkowski, PI/PM
- Prof. Rong Tang
- Prof. Boris Yamrom, consultant
- Ms. Sharon Small, Research Scientist
- Mr. Ting Liu, Graduate Student
- Mr. Nobuyuki Shimizu, Graduate Student
- Mr. Tom Palen, summer intern
- Mr. Peter LaMonica, summer intern/AFRL
- Rutgers
- Prof. Paul Kantor, co-PI
- Prof. K.B. Ng
- Prof. Nina Wacholder
- Mr. Robert Rittman, Graduate Student
- Ms. Ying Sun, Graduate Student
- Mr. Peng Song, Graduate student
27HITIQA Concept
28Key Research Issues
- Question Semantics
- how the system understands user requests
- Human-Computer Dialogue
- how the user and the system negotiate this
understanding - Information Quality Metrics
- how some information is better than other
- Information Fusion
- how to assemble the answer that fits user needs.
29DB
Document Retrieval
Segment/ Filter
Question Processor
question
Cluster Segments
Dialogue Manager
Query Refinement
Build Frames
Gate
answer
Process Frames
Answer Generator
Visualization
Wordnet
Current Focus
Completed Work
30Data-Driven NL Semantics
- What does the question mean to the user?
- The speech act
- The focus
- Users task, intention, goal
- Users background knowledge
- What does the question mean to the system?
- Available information
- Information that can be retrieved
- The dimensions of the retrieved information
User Semantics
System Semantics
31Answer Space Topology
ALL RETRIEVED FRAMES
KERNEL QUESTION MATCH
NEAR MISSES, ALTERNATIVE INTERPRETATIONS
32Quality Judgments
- Focus Group
- Sessions conducted March-April, 2002
- Results Nine quality aspects generated
- Expert Sessions
- Sessions Conducted May-June, 2002
- Results 100 documents scored twice along 9
quality aspects - Student Sessions
- Training and Testing Sessions June-July, 2002
- 10 documents judged by experts used for
training/testing - Actual Judgment Sessions June-August, 2002
- Qualified students evaluated 10 documents per
session - Results 900 documents scored twice along 9
quality aspects
33Factor Analysis of 9 Quality Features
34Modeling Quality of Text
- Kitchen sink approach
- 160 independent variables
- Part-of-speech, vocabulary
- stylistics, named entities,
- Statistical pruning
- Statistically significant variables
- May be nonsensical to human
- Human pruning
- Only sensible variables retained for each
quality - Pruning improves performance
- Kitchen sink overfits
- Statistics and Human close in performance
- More work needed to understand the relationship
35Performance of models
36(No Transcript)
37In Summary
- The two conceptual foundations of librarianship
cataloging and reference, translate to two
important problems in managing streams of textual
messages - Both involve pattern recognition or machine
learning.
38Two Roles for Learning
- Cataloging learning which features of a message
mean that it is significant to the problem at
hand - Reference learning which features of a message
mean that it is salient to a specific user of
the system.
39AppendixThe following slides were not presented
at the conference.
40Communicating Credibility
- A system that is correct 75 or 80 of the time
will be wrong one time in every four or 5. - Unless it can shade its judgments or
recommendations, the analyst will lose confidence
in it. - Credibility must be high enough to avoid
extensive rework.
41Data Fusion
- Use multiple methods to assess the relevance of
documents or passages, - For a given question, dialogue, or cluster
- Each method assigns a score
- Candidates ? points in a score space
- Seek patterns to localize the most relevant
documents or passages in this score space - Developed interactive data analysis tool
42Background on Fusion Problem
- There are systems S, T, U,
- There are problems to be solved P,Q,R
- This defines several fusion problems
- Local fusion for a given problem P, and a pair
of systems S,T, what is the best fusion rule - Let s(d) ,t(d) be the scores assigned to
document d by systems S and T. Fusion tries to
find the best combining function f(s,t)
43Non-linear iso-relevance
44Local Fusion Rule
- A local fusion rule fP(s,t) depends on the
specific problem P. - This is relevant if P represents a static problem
or profile, which will be considered on many
occasions - A global fusion rule f(s,t) does not depend on a
specific problem P, - and can be safely used on a variety of problems.
45Local Fusion Results are Good
- Completely rigorous For each topic
- 1) Randomly split the documents into two parts
training and testing - 2) Do the logistic regression on training part
and get the fusion scores for both training and
testing documents - 3) Calculate p_100 on testing documents.
- 4) Excellent results (one random sample for each)
- 5) Test SMART and InQuery on the same random
testing set
46Summary of Local Fusion
PROBLEM CASE We ran 5 split half runs on the odd
case (318) and the results persist.
47Is Local Sensible?
- Local fusion depends on getting information about
a particular topic, and doing the best possible
fusion. - Not available in an AdHoc (e.g. Google) setting
- Potentially available in an intelligence
applications - -filtering standing profile
48(No Transcript)
49(No Transcript)
50Our Approach to Retrieval Fusion
Adaptive Local Fusion
Request
SMART
Result Set
USE Better System
InQuery
Result Set
DOCUMENTS
SETS
FUSION
Monitor Fusion Set and Receive Feedback
PROCESS
Delivered SET
ADOPT Fusion System