Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. - PowerPoint PPT Presentation

About This Presentation
Title:

Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR.

Description:

Dr. Rafail Ostrovsky, Telcordia Technologies, -algorithms ... Algorithm detects a new class. Human analyst labels it; determines its significance ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 51
Provided by: Kan143
Category:

less

Transcript and Presenter's Notes

Title: Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR.


1
Libraries and IntelligenceNSF/NIJ Symposium on
Intelligence and Security Informatics. Tucson, AR.
  • Paul Kantor
  • June 2, 2003
  • Research supported in part by the National
    Science Foundation under Grant EIA-0087022and by
    the Advanced Research Development Activity under
    Contract 2002-H790400-000. The
  • views expressed in this presentation are those of
    the author, and do not necessarily represent the
    views of the sponsoring agency.

2
Relation to General Intelligence and Security
Informatics
  • Signal information
  • Map and image information
  • Sound/voice information
  • Geographic information
  • Structured (Database) information
  • Free form textual information in machine readable
    form

3
Relation to Librarianship
  • Much of the needed technology for managing
    information related to homeland security is of
    the same type that librarians have provided by
    hand.
  • But ..
  • Millions of documents
  • dozens of languages
  • many media

4
Librarianship
  • Cataloging organizing information according to
    what it is about
  • Classification Machine Learning
  • Use training examples
  • Adapt as more data is received
  • Filter huge streams of potentially relevant data
  • Monitoring Message Streams

5
Librarianship
  • Reference
  • Understand what the user wants
  • Understand both relevance and quality/genre
  • Learn from a dialog with the user
  • Intelligent Question Answering

6
Two Projects
  • Filtering/Monitoring Message Streams National
    Science Foundation (NSF) -- acting for the
    National Security Agency HITIQA - High quality
    interactive Question Answering
  • Advanced Research Development Activity (ARDA) of
    the Intelligence Community

7
OBJECTIVE
Monitor streams of textualized communication to
detect pattern changes and "significant" events
  • Motivation
  • monitoring of global satellite communications
    (though this may produce voice rather than text)
  • sniffing and monitoring email traffic

8
(No Transcript)
9
MMS TeamStatisticians, computer scientists,
experts in info. Retrieval library science, etc
Dr. Rafail Ostrovsky, Telcordia Technologies,
-algorithms Prof. Endre Boros, --Boolean
optimization. Dr. Vladimir Menkov programming
Dr. Alex Genkin programming Mr. Andrei
Anghelescu graduate asisstant Mr. Dmitiry
Fradkin graduate assistant
  • Prof. Fred Roberts decision rules
  • Prof. David Madigan statistics
  • Dr. David Lewis text classification
  • Prof. Paul Kantor info science
  • Prof. Ilya Muchnik statistics
  • Prof. Muthu Muthukrishnan algorithms
  • Dr. Martin Strauss, ATT Labs algorithms

10
TECHNICAL PROBLEM
  • Given stream of text in any language.
  • Decide whether "new events" are present in the
    flow of messages.
  • Event new topic or topic with unusual level of
    activity.
  • Retrospective or Supervised Event
    Identification Classification into pre-existing
    classes.

11
  • More Complex Problem Prospective Detection or
    Unsupervised Learning
  • Classes change - new classes or change meaning
  • A difficult problem in statistics
  • Recent new CS approaches
  • Algorithm detects a new class
  • Human analyst labels it determines its
    significance

12
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
  • (1). Compression of Text -- to meet storage and
    processing limitations
  • (2). Representation of Text -- put in form
    amenable to computation and statistical analysis
  • (3). Matching Scheme -- computing similarity
    between documents
  • (4). Learning Method -- build on judged examples
    to determine characteristics of document cluster
    (event)
  • (5). Fusion Scheme -- combine methods (scores) to
    yield improved detection/clustering.

13
Fusion
Learning
Representation
Matching
Compression
Random Projections
Rocchio separator
Discriminant Analysis
Bag of Words
tf-idf
kNN
Boolean Random Projections
Bag of Bits
Naïve Bayes
Support Vector Machines
Boolean
Robust Feature Selection
Sparse Bayes
Non-linear Classifiers
r-NN
Combinatorial Clustering
Project Components Rutgers DIMACS MMS
14
Proposed Advances
  • Existing methods use some or all 5 automatic
    processing components, but dont exploit the full
    power of the components and/or an understanding
    of how to apply them to text data.
  • Lewis' methods used an off-the-shelf support
    vector machine supervised learner, but tuned it
    for frequency properties of the data.Very good
    TREC 2002 results on batch learning.
  • Chinese Academy of Sciences used most basic
    linear classifier (Roccho model) and achieved the
    best adaptive learning)

15
Proposed Advances II
  • We can trace a path (called a homotopy) in method
    space, from a poor Rocchio model to the CAS one
    -- find some better results along the way.
  • Next steps are
  • more sophisticated statistical methods
  • sophisticated data compression in a
    pre-processing stage

16
MORE SOPHISTICATED STATISTICAL APPROACHES
  • Representations Boolean representations
    weighting schemes
  • Matching Schemes Boolean matching nonlinear
    transforms of individual feature values
  • Learning Methods new kernel-based methods
    (nonlinear classification) more complex Bayes
    classifiers to assign objects to highest
    probability class
  • Fusion Methods combining scores based on ranks,
    linear functions, or nonparametric schemes
  • .

17
THE APPROACH
  • Identify best combination of newer methods
    through careful exploration of variety of tools.
  • Address issues of effectiveness (how well task is
    done) and efficiency (in computational time and
    space)
  • Use combination of new or modified algorithms and
    improved statistical methods built on the
    algorithmic primitives.
  • Systematic Experimentation on components and on
    fusion schemes
  • .

18
Mercer Kernels
K pos. semi-definite
Mercers Theorem gives necessary and sufficient
conditions for a continuous symmetric function K
to admit this representation Mercer
Kernels This kernel defines a set of functions
HK, elements of which have an expansion
as This set of functions is a reproducing
kernel hilbert space
Prepared by David L. Madigan
19
Support Vector Machine
Two-class classifier with the form parameters
chosen to minimize Many of the fitted ?s are
usually zero xs corresponding the the non-zero
?s are the support vectors.
tuning constant
complexity penalty
Gram matrix
Prepared by David L. Madigan
20
Regularized Linear Feature Space Model
Choose a model of the form to
minimize Solution is finite dimensional
prediction is sign(f(x))
just need to know K, not ? !
A kernel is a function K, such that for all x,z
?X where ? is a mapping from X to an inner
product feature space F
Prepared by David L. Madigan
21
Mixture Models
  • Pr(dRel)af(d)(1-a)g(d)
  • f, g may be centered at different points in
    document space. So distinct conceptual
    representations are accommodated easily.
  • Examples multinomial distributions.

22
Example Results on Fusion
  • http//dimacspc6.rutgers.edu/dfradkin/fusion/cent
    roid/try.pdf
  • http//dimacspc6.rutgers.edu/dfradkin/applet/topi
    cShowApplet.jsp
  • 60,000 documents.

23
Learning takes place in two spaces For matching
and filtering, we learn rules in the primary
space of document features. For fusion processes
we learn rules in a secondary space of
pseudo-features which are assigned by entire
systems, to incoming documents.
Random Subspace
Feature space
Score space
Relevant
Relevant
24
REFERENCE ASPECT
  • Effective Communication with the Analyst User

25
HITIQA High-Quality Interactive Question
Answering
  • University at Albany, SUNY
  • Rutgers University

26
HITIQA Team
  • SUNY Albany
  • Prof. Tomek Strzalkowski, PI/PM
  • Prof. Rong Tang
  • Prof. Boris Yamrom, consultant
  • Ms. Sharon Small, Research Scientist
  • Mr. Ting Liu, Graduate Student
  • Mr. Nobuyuki Shimizu, Graduate Student
  • Mr. Tom Palen, summer intern
  • Mr. Peter LaMonica, summer intern/AFRL
  • Rutgers
  • Prof. Paul Kantor, co-PI
  • Prof. K.B. Ng
  • Prof. Nina Wacholder
  • Mr. Robert Rittman, Graduate Student
  • Ms. Ying Sun, Graduate Student
  • Mr. Peng Song, Graduate student

27
HITIQA Concept
28
Key Research Issues
  • Question Semantics
  • how the system understands user requests
  • Human-Computer Dialogue
  • how the user and the system negotiate this
    understanding
  • Information Quality Metrics
  • how some information is better than other
  • Information Fusion
  • how to assemble the answer that fits user needs.

29
DB


Document Retrieval
Segment/ Filter
Question Processor
question
Cluster Segments

Dialogue Manager
Query Refinement

Build Frames
Gate
answer
Process Frames
Answer Generator
Visualization
Wordnet
Current Focus

Completed Work

30
Data-Driven NL Semantics
  • What does the question mean to the user?
  • The speech act
  • The focus
  • Users task, intention, goal
  • Users background knowledge
  • What does the question mean to the system?
  • Available information
  • Information that can be retrieved
  • The dimensions of the retrieved information

User Semantics
System Semantics
31
Answer Space Topology
ALL RETRIEVED FRAMES
KERNEL QUESTION MATCH
NEAR MISSES, ALTERNATIVE INTERPRETATIONS
32
Quality Judgments
  • Focus Group
  • Sessions conducted March-April, 2002
  • Results Nine quality aspects generated
  • Expert Sessions
  • Sessions Conducted May-June, 2002
  • Results 100 documents scored twice along 9
    quality aspects
  • Student Sessions
  • Training and Testing Sessions June-July, 2002
  • 10 documents judged by experts used for
    training/testing
  • Actual Judgment Sessions June-August, 2002
  • Qualified students evaluated 10 documents per
    session
  • Results 900 documents scored twice along 9
    quality aspects

33
Factor Analysis of 9 Quality Features
34
Modeling Quality of Text
  • Kitchen sink approach
  • 160 independent variables
  • Part-of-speech, vocabulary
  • stylistics, named entities,
  • Statistical pruning
  • Statistically significant variables
  • May be nonsensical to human
  • Human pruning
  • Only sensible variables retained for each
    quality
  • Pruning improves performance
  • Kitchen sink overfits
  • Statistics and Human close in performance
  • More work needed to understand the relationship

35
Performance of models
36
(No Transcript)
37
In Summary
  • The two conceptual foundations of librarianship
    cataloging and reference, translate to two
    important problems in managing streams of textual
    messages
  • Both involve pattern recognition or machine
    learning.

38
Two Roles for Learning
  • Cataloging learning which features of a message
    mean that it is significant to the problem at
    hand
  • Reference learning which features of a message
    mean that it is salient to a specific user of
    the system.

39
AppendixThe following slides were not presented
at the conference.
40
Communicating Credibility
  • A system that is correct 75 or 80 of the time
    will be wrong one time in every four or 5.
  • Unless it can shade its judgments or
    recommendations, the analyst will lose confidence
    in it.
  • Credibility must be high enough to avoid
    extensive rework.

41
Data Fusion
  • Use multiple methods to assess the relevance of
    documents or passages,
  • For a given question, dialogue, or cluster
  • Each method assigns a score
  • Candidates ? points in a score space
  • Seek patterns to localize the most relevant
    documents or passages in this score space
  • Developed interactive data analysis tool

42
Background on Fusion Problem
  • There are systems S, T, U,
  • There are problems to be solved P,Q,R
  • This defines several fusion problems
  • Local fusion for a given problem P, and a pair
    of systems S,T, what is the best fusion rule
  • Let s(d) ,t(d) be the scores assigned to
    document d by systems S and T. Fusion tries to
    find the best combining function f(s,t)

43
Non-linear iso-relevance
44
Local Fusion Rule
  • A local fusion rule fP(s,t) depends on the
    specific problem P.
  • This is relevant if P represents a static problem
    or profile, which will be considered on many
    occasions
  • A global fusion rule f(s,t) does not depend on a
    specific problem P,
  • and can be safely used on a variety of problems.

45
Local Fusion Results are Good
  • Completely rigorous For each topic
  • 1) Randomly split the documents into two parts
    training and testing
  • 2) Do the logistic regression on training part
    and get the fusion scores for both training and
    testing documents
  • 3) Calculate p_100 on testing documents.
  • 4) Excellent results (one random sample for each)
  • 5) Test SMART and InQuery on the same random
    testing set

46
Summary of Local Fusion
PROBLEM CASE We ran 5 split half runs on the odd
case (318) and the results persist.
47
Is Local Sensible?
  • Local fusion depends on getting information about
    a particular topic, and doing the best possible
    fusion.
  • Not available in an AdHoc (e.g. Google) setting
  • Potentially available in an intelligence
    applications - -filtering standing profile

48
(No Transcript)
49
(No Transcript)
50
Our Approach to Retrieval Fusion
Adaptive Local Fusion
Request
SMART
Result Set
USE Better System
InQuery
Result Set
DOCUMENTS
SETS
FUSION
Monitor Fusion Set and Receive Feedback
PROCESS
Delivered SET
ADOPT Fusion System
Write a Comment
User Comments (0)
About PowerShow.com