Title: Novelty Detection and Profile Tracking from Massive Data
1Novelty Detection and Profile Tracking from
Massive Data
Jaime Carbonell Eugene Fink
Santosh Ananthraman
2Motivation
Search for interesting patternsin large data sets
3Motivation
Search for interesting patternsin large data sets
- Current applications
- Processing of intelligence data
- Prediction of natural threats
- Future applications
- Scientific discoveries
- Analysis of business data
- and more
4Outline
- Main results of the ARGUS project -
Approximate matching - Streaming data -
Novelty detection - More about approximate matching - Records and
queries - Search for matches - Experimental
results
5Large data sets
Large From a million (106) to several billion
(1010) records
Data Structured records with numbers, strings,
and nominal values
Sets Databases and streams of records
- Specific sets
- Hospital admissions (1.7 million records)
- Network flow (5 trillion records)
- Federal wire (simulated data)
6Main results
We have developed a system thataddresses three
problems
- Retrieval of approximate matches for known
patterns
- Processing of streaming data
- Identification of new patterns and gradual
changes in old patterns
7Approximate matching
Fast identification of approximatematches in
large sets of records
- Examples
- Misspelled names
- Inexact numbers
- Spatial proximity
8Streaming data
Continuous search for matchesin a stream of new
records
- Maintain a set of pending queries
- Identify matches for these queries among
incoming records
9RETE network
Identify common parts of queries andarrange them
into a RETE network, which significantly reduces
the matching time
- Hundreds to thousands of pending queries
- Tens to hundreds of records per second
10Novelty detection
11Example Static event
12Example New event
density
distance
13Example Hidden event
density
distance
14Example Growing event
density
distance
15Visualization
- Display of records, clusters, and queries in
two and three dimensions - Access to data tables and analysis results
16Example Data and clusters
17Example Density analysis
18Information flow
19Outline
- Main results of the ARGUS project -
Approximate matching - Streaming data -
Novelty detection - More about approximate matching - Records and
queries - Search for matches - Experimental
results
20Motivation
Retrieval of relevant records basedon partially
inaccurate information
- Inaccurate records
- Inaccurate queries
- Incomplete knowledge
21Table of records
We specify a table of records by a list of
attributes
Example We can describe patients in a hospitalby
their sex, age, and diagnosis
22Records and queries
A record includes a specificvalue for each
attribute
A query may include lists ofvalues and numeric
ranges
Query Sex male, female Age 20..40 Dx asthma,
flu
23Query types
24Exact matches
A record is an exact match for a query if every
value in the record belongs tothe respective
range in the query
25Approximate matches
A record is an approximate match for aquery if
it is close to the query region
Record
26Approximate queries
An approximate query includes
27Indexing tree
- Maintain a PATRICIA tree of records
male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
28Search for matches
- Depth-first search for exact matches
- Best-first search for approximate matches
male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
29Performance
Experiments with a database of all
patientsadmitted to Massachusetts hospitals
fromOctober 2000 to September 2002
- Twenty-one attributes
- 1.7 million records
- Use of a Pentium computer
- 2.4 GHz CPU
- 1 Gbyte memory
- 400 MHz bus
30Variables
- Control variables
- Number of records
- Memory size
- Query type
- Measurements
- Retrieval time
31Small memory
- Number of records 100 to 1,670,000
- Memory size 4 MByte
32Large memory
- Number of records 1,670,000
- Memory size 64 to 1,024 MByte
33Scalability
Retrieval time grows as fractionalpower (about
0.5) of database size
Number ofrecords (n) n 0.5 time(seconds)
1,000,000100,000,00010,000,000,000 0.05 . 0.50 .5.00 .
34Distributed architecture
Indexing trees on multiple computers
35Conclusions
- We have developed a set of tools for analysis
of massive structured data
- Experiments have shown that it improves the
productivity of intelligence analysts
- Future work includes development of more
tools and application to other domains