Anomaly Detection for Scientific Data

About This Presentation

Title:

Description:

Number of Views:48

Avg rating:3.0/5.0

Slides: 10

Provided by: markschw

Category:

Tags: anomaly | data | detection | orca | scientific

Transcript and Presenter's Notes

Title: Anomaly Detection for Scientific Data

1
Anomaly Detection for Scientific Data

2
What is Anomaly Detection?

Seek to find parts of the data (anomalies) that
are different from the rest of the data
Supervised approaches use examples of
anomalies unsupervised approaches do not.

3
How can Anomaly Detectionbe Applied to
Scientific Data?

4
Example Earth Science ApplicationVegetation Data

Joint work with Ranga Myneni of Boston University
Used Leaf Area Index (LAI) Fraction Absorbed of
Photosynthetically Available Radiation (FPAR)
from Moderate Resolution Imaging
Spectroradiometer (MODIS) instrument aboard the
Terra and Aqua satellites

5
Results

Used MODIS data from one time point at 4 km
resolution (7.7 million pixels within Earths
land area)
Used 4 variables LAI, FPAR, QA, and latitude
Used an unsupervised, distance-based anomaly
detection algorithm
The 1 outlier was in northern Russia and the 2
outlier was in southern New Zealand
Both points had unusually high LAI and FPAR
values for their latitudes
Investigation revealed a bug in the software that
produced the LAI and FPAR products
Error was corrected, and new versions of the data
were made available to the scientific community.

6
Algorithm Used Orca(Distance-Based Outliers)

Joint work with Stephen Bay of ISLE
7
Orca Algorithm

Based on nested loops
For each example, find its nearest neighbors
with a sequential scan
Modified with a pruning rule
While performing the sequential scan,
Keep track of closest neighbors found so far
prune examples once the neighbors found so far
indicate that the example cannot be a top outlier
Worst case O(N2) distance computations
In practice, runs in nearly linear time
Can handle millions of data points

8
Conclusions

Anomaly detection algorithms can find
previously-unknown anomalies in large scientific
data sets
Could lead to scientific discoveries or
correction of errors
Different algorithms find qualitatively different
anomalies, so it is worth running multiple
algorithms
I presented one algorithm (Orca) that runs in
nearly linear time so it can be applied to very
large data sets

9
Pruning