Novelty Detection and Profile Tracking from Massive Data - PowerPoint PPT Presentation

About This Presentation
Title:

Novelty Detection and Profile Tracking from Massive Data

Description:

Scientific discoveries. Analysis of business data. and more ... Outline ... Large: From a million (106) to several. billion (1010) records ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 36
Provided by: euge52
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Novelty Detection and Profile Tracking from Massive Data


1
Novelty Detection and Profile Tracking from
Massive Data
Jaime Carbonell Eugene Fink
Santosh Ananthraman
2
Motivation
Search for interesting patternsin large data sets
3
Motivation
Search for interesting patternsin large data sets
  • Current applications
  • Processing of intelligence data
  • Prediction of natural threats
  • Future applications
  • Scientific discoveries
  • Analysis of business data
  • and more

4
Outline
  • Main results of the ARGUS project -
    Approximate matching - Streaming data -
    Novelty detection
  • More about approximate matching - Records and
    queries - Search for matches - Experimental
    results

5
Large data sets
Large From a million (106) to several billion
(1010) records
Data Structured records with numbers, strings,
and nominal values
Sets Databases and streams of records
  • Specific sets
  • Hospital admissions (1.7 million records)
  • Network flow (5 trillion records)
  • Federal wire (simulated data)

6
Main results
We have developed a system thataddresses three
problems
  • Retrieval of approximate matches for known
    patterns
  • Processing of streaming data
  • Identification of new patterns and gradual
    changes in old patterns

7
Approximate matching
Fast identification of approximatematches in
large sets of records
  • Examples
  • Misspelled names
  • Inexact numbers
  • Spatial proximity

8
Streaming data
Continuous search for matchesin a stream of new
records
  • Maintain a set of pending queries
  • Identify matches for these queries among
    incoming records

9
RETE network
Identify common parts of queries andarrange them
into a RETE network, which significantly reduces
the matching time
  • Hundreds to thousands of pending queries
  • Tens to hundreds of records per second

10
Novelty detection
11
Example Static event
12
Example New event
density
distance
13
Example Hidden event
density
distance
14
Example Growing event
density
distance
15
Visualization
  • Display of records, clusters, and queries in
    two and three dimensions
  • Access to data tables and analysis results

16
Example Data and clusters
17
Example Density analysis
18
Information flow
19
Outline
  • Main results of the ARGUS project -
    Approximate matching - Streaming data -
    Novelty detection
  • More about approximate matching - Records and
    queries - Search for matches - Experimental
    results

20
Motivation
Retrieval of relevant records basedon partially
inaccurate information
  • Inaccurate records
  • Inaccurate queries
  • Incomplete knowledge

21
Table of records
We specify a table of records by a list of
attributes
Example We can describe patients in a hospitalby
their sex, age, and diagnosis
22
Records and queries
A record includes a specificvalue for each
attribute
A query may include lists ofvalues and numeric
ranges
Query Sex male, female Age 20..40 Dx asthma,
flu
23
Query types
24
Exact matches
A record is an exact match for a query if every
value in the record belongs tothe respective
range in the query
25
Approximate matches
A record is an approximate match for aquery if
it is close to the query region
Record
26
Approximate queries
An approximate query includes
  • Point or region
  • Distance function
  • Number of matches
  • Distance limit

27
Indexing tree
  • Maintain a PATRICIA tree of records

male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
28
Search for matches
  • Depth-first search for exact matches
  • Best-first search for approximate matches

male
female
30
50
40
30
ulcer
asthma
fracture
asthma
flu
flu
29
Performance
Experiments with a database of all
patientsadmitted to Massachusetts hospitals
fromOctober 2000 to September 2002
  • Twenty-one attributes
  • 1.7 million records
  • Use of a Pentium computer
  • 2.4 GHz CPU
  • 1 Gbyte memory
  • 400 MHz bus

30
Variables
  • Control variables
  • Number of records
  • Memory size
  • Query type
  • Measurements
  • Retrieval time

31
Small memory
  • Number of records 100 to 1,670,000
  • Memory size 4 MByte

32
Large memory
  • Number of records 1,670,000
  • Memory size 64 to 1,024 MByte

33
Scalability
Retrieval time grows as fractionalpower (about
0.5) of database size
Number ofrecords (n) n 0.5 time(seconds)
1,000,000100,000,00010,000,000,000 0.05 . 0.50 .5.00 .
34
Distributed architecture
Indexing trees on multiple computers
35
Conclusions
  • We have developed a set of tools for analysis
    of massive structured data
  • Experiments have shown that it improves the
    productivity of intelligence analysts
  • Future work includes development of more
    tools and application to other domains
Write a Comment
User Comments (0)
About PowerShow.com