Title: A Research Sampler
1A Research Sampler
- shasha_at_cs.nyu.edu
- http//cs.nyu.edu/cs/faculty/shasha/index.html
2Philosophy
- Research should be fun -- good puzzles,
interesting algorithms. - Research should be useful -- work with real users
whenever possible. - Implementation should be fast (I use a very
powerful programming environment that I expect my
students to learn)
3Thesis Philosophy
- Ideal thesis should have an interesting algorithm
with analysis, an implementation, and users. - Of the 15 theses I have supervised, 13 follow
this model. The other two were pure systems
theses.
4Current Research Topics
- Time series analysis finding correlation/bursts.
Query by humming. - AQuery Database for ordered data (like time
series) - Computational biology data analysis,
visualization, proteomics
5Online Pattern Discovery
- Sensor-less Pairs-trading in stock trading (find
highly correlated pairs in n log n time) - Sensor-full Gamma Ray Detection in astrophysics
(burst detection over a large number of window
sizes in almost linear time)
- Dennis Shasha (joint work with Yunyue Zhu,
Xiaojian Zhao, Zhihua Wang, Tyler Neylon, Xin
Zhang and Prof Richard Cole)
6Goal of this work
- Time series are important in so many applications
biology, medicine, finance, music, physics, - A few fundamental operations occur all the time
burst detection, correlation, pattern matching. - Extend functionality for music and science.
7StatStream (VLDB,2002) Example
- Stock prices streams
- The New York Stock Exchange (NYSE)
- 50,000 securities (streams) 100,000 ticks (trade
and quote) - Pairs Trading, a.k.a. Correlation Trading
- Querywhich pairs of stocks were correlated with
a value of over 0.9 for the last three hours?
8StatStream (VLDB,2002) Example
XYZ and ABC have been correlated with a
correlation of 0.95 for the last three hours. Now
XYZ and ABC become less correlated as XYZ goes up
and ABC goes down. They should converge back
later. I will sell XYZ and buy ABC
9Online Detection of High Correlation
- Given tens of thousands of high speed time series
data streams, to detect high-value correlation,
including synchronized and time-lagged, over
sliding windows in real time. - Real time
- high update frequency of the data stream
- fixed response time, online
10Online Detection of High Correlation
11StatStream Algorithm
- Naive algorithm
- N number of streams
- w size of sliding window
- space O(N) and time O(N2w) VS space O(N2) and
time O(N2) . - Suppose that the streams are updated every
second. - With a Pentium 4 PC, the exact computing method
can only monitor 700 streams with a delay of 2
minutes. - Our Approach
- Use Discrete Fourier Transform to approximate
correlation - Use grid structure to filter out unlikely pairs
- Our approach can monitor 10,000 streams with a
delay of 2 minutes.
12Empirical Study Speed
Our algorithm is parallelizable.
13Sketches Random Projection
- Correlation between time series of the returns of
stock - Since most stock price time series are close to
random walks, their return time series are close
to white noise - DFT/DWT cant capture approximate white noise
series because there is no clear trend (too many
frequency components). - Solution Sketches (a form of random landmark)
- Sketches pool matrix of random variables drawn
from stable distribution - Sketches The random projection of all time
series to lower dimensions by multiplication with
the same matrix - The Euclidean distance (correlation) between time
series is approximated by the distance between
their sketches with a probabilistic guarantee.
14Burst Detection
15Burst Detection Applications
- Discovering intervals with unusually large
numbers of events. - In astrophysics, the sky is constantly observed
for high-energy particles. When a particular
astrophysical event happens, a shower of
high-energy particles arrives in addition to the
background noise. Might last milliseconds or
days - In telecommunications, if the number of packages
lost within a certain time period exceeds some
threshold, it might indicate some network
anomaly. Exact duration is unknown. - In finance, stocks with unusual high trading
volumes should attract the notice of traders (or
perhaps regulators).
16Bursts across different window sizes in Gamma Rays
- Challenge to discover not only the time of the
burst, but also the duration of the burst.
17Elastic Burst Detection Problem Statement
- Problem Given a time series of positive numbers
x1, x2,..., xn, and a threshold function f(w),
w1,2,...,n, find the subsequences of any size
such that their sums are above the thresholds - all 0ltwltn, 0ltmltn-w, such that xm xm1 xmw-1
f(w) - Brute force search O(n2) time
- Our shifted wavelet tree (SWT) O(nk) time.
- k is the size of the output, i.e. the number of
windows with bursts
18Burst Detection Data Structure and Algorithm
- Define threshold for node for size 2k to be
threshold for window of size 1 2k-1
19Empirical Study Stock Price Spread Burst
20Elastic Burst in two dimensions
- Population Distribution in the US
21Summary
- Able to detect bursts of many different durations
in essentially linear time. - Can be used both for time series and for spatial
searching. - Can specify thresholds either with absolute
numbers or with probability of hit. - Algorithm is simple to implement and has low
constants (code is available). - Ok, its embarrassingly simple.
22With a Little Help From My Warped Correlation
- Karens humming Match
- Denniss humming Match
- What would you do if I sang out of tune?"
- Yunyues humming Match
23Related Work in Query by Humming
- Traditional method String Matching
Ghias et. al. 95, McNab
et.al. 97,Uitdenbgerd and Zobel 99 - Music represented by string of pitch directions
U, D, S (degenerated interval) - Hum query is segmented to discrete notes, then
string of pitch directions - Edit Distance between hum query and music score
- Problem
- Very hard to segment the hum query
- Partial solution users are asked to hum
articulately - New Method matching directly from audio
Mazzoni and Dannenberg 00 - Problem
- slowed down by DTW
24Time Series Representation of Query
Segment this!
- An example hum query
- Note segmentation is hard!
25How to deal with poor hum queries?
- No absolute pitch
- Solution the average pitch is subtracted
- Incorrect tempo
- Solution Uniform Time Warping
- Inaccurate pitch intervals
- Solution return the k-nearest neighbors
- Local timing variations
- Solution Dynamic Time Warping
26Dynamic Time Warping
- Euclidean distance sum of point-by-point
distance - DTW distance allowing stretching or squeezing
the time axis locally
27Dynamic Time Warping
28AQuery A Database System for Order
- Dennis Shashajoint work with Alberto Lerner
- lerner,shasha_at_cs.nyu.edu
29Idea
- Whatever can be done on a table can be done on an
ordered table (arrable). Not vice-versa. - AQuery query language on arrables
- Expresses many queries easily
- Elegant new optimizations.
30And Streams?
- AQuery has no special facilities for streaming
data, but it is expressive enough. - Idea for streaming data is to split the tables
into tables that are indexed with old data and a
buffer table with recent data. - Optimizer works over both transparently.
31Computational Biology
- Collaborations with several groups at NYU (plant
and worm), Duke, Yale. - Growth area biologists need us, but we have a
lot to learn. - Big issues control experimental space, evaluate
data, infer an active (rather than just paper)
model combinatorial design. - Visualization.
32Sungear Design
- Generalizes Venn diagrams to more than three
- Visual outline is an ellipse having anchors on
borders and vessels in the interior. - Each vessel points to associated anchors.
- Linked views to hierarchies, lists, and graphs,
so can simultaneously update data depending on
user queries (selection events).
33Venn Diagram great for three factors
34(No Transcript)
35Sungear Principle
- Sungear is stupid
- Doesnt care which kind of data it is
representing, though there is built-in support
for genes (because of links to GO and to
cytoscape). - Basic Sungear representation could be used to
describe anything from yachting gear to
demographics.
36Summary
- Hard problems with practical motivation.
- Fun algorithms not afraid of heuristics.
- Fast, maintainable, portable applications.