A Research Sampler - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

A Research Sampler

Description:

Our approach can monitor 10,000 streams with a delay of 2 minutes. Empirical Study : Speed ... Each vessel points to associated anchors. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 37
Provided by: dennis47
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: A Research Sampler


1
A Research Sampler
  • shasha_at_cs.nyu.edu
  • http//cs.nyu.edu/cs/faculty/shasha/index.html

2
Philosophy
  • Research should be fun -- good puzzles,
    interesting algorithms.
  • Research should be useful -- work with real users
    whenever possible.
  • Implementation should be fast (I use a very
    powerful programming environment that I expect my
    students to learn)

3
Thesis Philosophy
  • Ideal thesis should have an interesting algorithm
    with analysis, an implementation, and users.
  • Of the 15 theses I have supervised, 13 follow
    this model. The other two were pure systems
    theses.

4
Current Research Topics
  • Time series analysis finding correlation/bursts.
    Query by humming.
  • AQuery Database for ordered data (like time
    series)
  • Computational biology data analysis,
    visualization, proteomics

5
Online Pattern Discovery
  • Sensor-less Pairs-trading in stock trading (find
    highly correlated pairs in n log n time)
  • Sensor-full Gamma Ray Detection in astrophysics
    (burst detection over a large number of window
    sizes in almost linear time)
  • Dennis Shasha (joint work with Yunyue Zhu,
    Xiaojian Zhao, Zhihua Wang, Tyler Neylon, Xin
    Zhang and Prof Richard Cole)

6
Goal of this work
  • Time series are important in so many applications
    biology, medicine, finance, music, physics,
  • A few fundamental operations occur all the time
    burst detection, correlation, pattern matching.
  • Extend functionality for music and science.

7
StatStream (VLDB,2002) Example
  • Stock prices streams
  • The New York Stock Exchange (NYSE)
  • 50,000 securities (streams) 100,000 ticks (trade
    and quote)
  • Pairs Trading, a.k.a. Correlation Trading
  • Querywhich pairs of stocks were correlated with
    a value of over 0.9 for the last three hours?

8
StatStream (VLDB,2002) Example
XYZ and ABC have been correlated with a
correlation of 0.95 for the last three hours. Now
XYZ and ABC become less correlated as XYZ goes up
and ABC goes down. They should converge back
later. I will sell XYZ and buy ABC
9
Online Detection of High Correlation
  • Given tens of thousands of high speed time series
    data streams, to detect high-value correlation,
    including synchronized and time-lagged, over
    sliding windows in real time.
  • Real time
  • high update frequency of the data stream
  • fixed response time, online

10
Online Detection of High Correlation
11
StatStream Algorithm
  • Naive algorithm
  • N number of streams
  • w size of sliding window
  • space O(N) and time O(N2w) VS space O(N2) and
    time O(N2) .
  • Suppose that the streams are updated every
    second.
  • With a Pentium 4 PC, the exact computing method
    can only monitor 700 streams with a delay of 2
    minutes.
  • Our Approach
  • Use Discrete Fourier Transform to approximate
    correlation
  • Use grid structure to filter out unlikely pairs
  • Our approach can monitor 10,000 streams with a
    delay of 2 minutes.

12
Empirical Study Speed
Our algorithm is parallelizable.
13
Sketches Random Projection
  • Correlation between time series of the returns of
    stock
  • Since most stock price time series are close to
    random walks, their return time series are close
    to white noise
  • DFT/DWT cant capture approximate white noise
    series because there is no clear trend (too many
    frequency components).
  • Solution Sketches (a form of random landmark)
  • Sketches pool matrix of random variables drawn
    from stable distribution
  • Sketches The random projection of all time
    series to lower dimensions by multiplication with
    the same matrix
  • The Euclidean distance (correlation) between time
    series is approximated by the distance between
    their sketches with a probabilistic guarantee.

14
Burst Detection
15
Burst Detection Applications
  • Discovering intervals with unusually large
    numbers of events.
  • In astrophysics, the sky is constantly observed
    for high-energy particles. When a particular
    astrophysical event happens, a shower of
    high-energy particles arrives in addition to the
    background noise. Might last milliseconds or
    days
  • In telecommunications, if the number of packages
    lost within a certain time period exceeds some
    threshold, it might indicate some network
    anomaly. Exact duration is unknown.
  • In finance, stocks with unusual high trading
    volumes should attract the notice of traders (or
    perhaps regulators).

16
Bursts across different window sizes in Gamma Rays
  • Challenge to discover not only the time of the
    burst, but also the duration of the burst.

17
Elastic Burst Detection Problem Statement
  • Problem Given a time series of positive numbers
    x1, x2,..., xn, and a threshold function f(w),
    w1,2,...,n, find the subsequences of any size
    such that their sums are above the thresholds
  • all 0ltwltn, 0ltmltn-w, such that xm xm1 xmw-1
    f(w)
  • Brute force search O(n2) time
  • Our shifted wavelet tree (SWT) O(nk) time.
  • k is the size of the output, i.e. the number of
    windows with bursts

18
Burst Detection Data Structure and Algorithm
  • Define threshold for node for size 2k to be
    threshold for window of size 1 2k-1

19
Empirical Study Stock Price Spread Burst
20
Elastic Burst in two dimensions
  • Population Distribution in the US

21
Summary
  • Able to detect bursts of many different durations
    in essentially linear time.
  • Can be used both for time series and for spatial
    searching.
  • Can specify thresholds either with absolute
    numbers or with probability of hit.
  • Algorithm is simple to implement and has low
    constants (code is available).
  • Ok, its embarrassingly simple.

22
With a Little Help From My Warped Correlation
  • Karens humming Match
  • Denniss humming Match
  • What would you do if I sang out of tune?"
  • Yunyues humming Match

23
Related Work in Query by Humming
  • Traditional method String Matching
    Ghias et. al. 95, McNab
    et.al. 97,Uitdenbgerd and Zobel 99
  • Music represented by string of pitch directions
    U, D, S (degenerated interval)
  • Hum query is segmented to discrete notes, then
    string of pitch directions
  • Edit Distance between hum query and music score
  • Problem
  • Very hard to segment the hum query
  • Partial solution users are asked to hum
    articulately
  • New Method matching directly from audio
    Mazzoni and Dannenberg 00
  • Problem
  • slowed down by DTW

24
Time Series Representation of Query
Segment this!
  • An example hum query
  • Note segmentation is hard!

25
How to deal with poor hum queries?
  • No absolute pitch
  • Solution the average pitch is subtracted
  • Incorrect tempo
  • Solution Uniform Time Warping
  • Inaccurate pitch intervals
  • Solution return the k-nearest neighbors
  • Local timing variations
  • Solution Dynamic Time Warping

26
Dynamic Time Warping
  • Euclidean distance sum of point-by-point
    distance
  • DTW distance allowing stretching or squeezing
    the time axis locally

27
Dynamic Time Warping
28
AQuery A Database System for Order
  • Dennis Shashajoint work with Alberto Lerner
  • lerner,shasha_at_cs.nyu.edu

29
Idea
  • Whatever can be done on a table can be done on an
    ordered table (arrable). Not vice-versa.
  • AQuery query language on arrables
  • Expresses many queries easily
  • Elegant new optimizations.

30
And Streams?
  • AQuery has no special facilities for streaming
    data, but it is expressive enough.
  • Idea for streaming data is to split the tables
    into tables that are indexed with old data and a
    buffer table with recent data.
  • Optimizer works over both transparently.

31
Computational Biology
  • Collaborations with several groups at NYU (plant
    and worm), Duke, Yale.
  • Growth area biologists need us, but we have a
    lot to learn.
  • Big issues control experimental space, evaluate
    data, infer an active (rather than just paper)
    model combinatorial design.
  • Visualization.

32
Sungear Design
  • Generalizes Venn diagrams to more than three
  • Visual outline is an ellipse having anchors on
    borders and vessels in the interior.
  • Each vessel points to associated anchors.
  • Linked views to hierarchies, lists, and graphs,
    so can simultaneously update data depending on
    user queries (selection events).

33
Venn Diagram great for three factors
34
(No Transcript)
35
Sungear Principle
  • Sungear is stupid
  • Doesnt care which kind of data it is
    representing, though there is built-in support
    for genes (because of links to GO and to
    cytoscape).
  • Basic Sungear representation could be used to
    describe anything from yachting gear to
    demographics.

36
Summary
  • Hard problems with practical motivation.
  • Fun algorithms not afraid of heuristics.
  • Fast, maintainable, portable applications.
Write a Comment
User Comments (0)
About PowerShow.com