Data%20Matters - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Matters

Description:

Title: Data Matters Author: Bernard Chazelle Last modified by: Bernard Created Date: 10/8/2004 10:17:12 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 58
Provided by: Berna121
Category:

less

Transcript and Presenter's Notes

Title: Data%20Matters


1
  • Sublinear
  • Algorithms

2
Sloan Digital Sky Survey
4 petabytes (1MG)
10 petabytes/yr
Biomedical imaging
150 petabytes/yr
3
  • Data

4
  • Data

5
massive input
output
Sample tiny fraction
Sublinear algorithms
6
(No Transcript)
7
Approximate MST
CRT 01
  • Optimal!

8
Reduces to counting connected components
9
E
no. connected components
2
var
ltlt (no. connected components)
10
Shortest Paths
CLM 03
11
Ray Shooting
CLM 03
  • Optimal!
  • Volume
  • Intersection
  • Point location

12
  • Self-Improving
  • Algorithms

13
  • 01101011011010101011001010101011010011100110101001
    0100010

low-entropy data
  • Takens embeddings
  • Markov models (speech)

14
Self-Improving Algorithms
Arbitrary, unknown random source
Sorting Matching MaxCut All pairs shortest
paths Transitive closure Clustering
15
Self-Improving Algorithms
Arbitrary, unknown random source
1. Run algorithm for best worst-case behavior
or best under uniform distribution or best
under some postulated prior.
2. Learning phase Algorithm finetunes itself
as it learns about the random source through
repeated use.
3. Algorithm settles to stationary status
optimal expected complexity under (still
unknown) random source.
16
Self-Improving Algorithms
  • 0110101100101000101001001010100010101001

time T1
time T2
time T3
time T4
time T5
E Tk ? Optimal expected time for random source
17
Sorting
(x1, x2, , xn)
  • each xi independent from Di
  • H entropy of rank distribution
  • Optimal!

18
Clustering
K-median (k2)
19
d
Minimize sum of distances
Hamming cube 0,1
20
d
Minimize sum of distances
Hamming cube 0,1
  • NP-hard

21
d
Minimize sum of distances
Hamming cube 0,1
KSS
22
How to achieve linear limiting expected time?
dn
Input space 0,1
Identify core
Use KSS
prob lt O(dn)/KSS
Tail
23
How to achieve linear limiting expected time?
NP vs P input vicinity ? algorithmic vicinity
Store sample of precomputed KSS
nearest neighbor
Incremental algorithm
24
Main difficulty How to spot the tail?
25
  • Online Data
  • Reconstruction

26
  • 01101011011010101011001010101011010011100110101001
    0100010
  • 011010110110101010110010101010100111001001
    0010

1. Data is accessible before noise
2. Or its not
2. Or ?
27
  • 011010110110101010110010101010100111001001
    0010

1. Data is accessible before noise
28
  • 01101011011010101011001010101011010011100110101001
    0100010
  • 010100001

decode
encode
  • error correcting codes

29
  • 01101011011010101011001010101011010011100110101001
    0100010

Data inaccessible before noise
Assumptions are necessary !
30
  • 01101011011010101011001010101011010011100110101001
    0100010

Data inaccessible before noise
1. Sorted sequence
2. Bipartite graph, expander
3. Solid w/ angular constraints
4. Low dim attractor set
31
  • 01101011011010101011001010101011010011100110101001
    0100010

Data inaccessible before noise
data must satisfy
some property P
but does not quite
32
f(x) ?
f access function
x
data
f(x)
But life being what it is
33
f(x) ?
x
data
f(x)
34
Humans
Define distance from any object to data class
35
  • no undo

f(x) ?
filter
x
x1, x2,
g(x)
f(x1), f(x2),
g is access function for
36
Similar to Self-Correction RS96, BLR93
except
about data, not functions
error-free
allows O(distance to property)
37
d
Monotone function n ? R
Filter requires polylog (n) queries
38
Offline reconstruction
39
Offline reconstruction
40
Online reconstruction
41
Online reconstruction
42
Online reconstruction
  • don't mortgage the future

43
Online reconstruction
  • early decisions are crucial !

44
monotone function
45
(No Transcript)
46
Frequency of a point
x
Smallest interval I containing gt I/2 violations
involving f(x)
47
Frequency of a point
48
Given x
1. estimate its frequency
2. if nonzero, find smallest interval
around x with both endpoints
having zero frequency
3. interpolate between f(endpoints)
49
To prove
1. Frequencies can be estimated in
polylog time
2. Function is monotone over
zero-frequency domain
3. ZF domain occupies (1-2
) fraction
50
Bivariate concave function
Filter requires polylog (n) queries
51

bipartite graph
  • open


k-connectivity

expander
52

denoising low-dim attractor sets
  • open


53
  • Priced
  • Computation

54
01001001110101011001000111o10010
Priced computation
accuracy
  • spectrometry/cloning/gene chip
  • PCR/hybridization/chromatography
  • gel electrophoresis/blotting
  • Linear programming

55
experimentation
computation
56
Factoring is easy. Heres why
Gaussian mixture sample 00100101001001101010101.
Pricing data
Ongoing project w/ Nir Ailon
57
Collaborators Nir Ailon, Seshadri Comandur, Ding
Liu


Write a Comment
User Comments (0)
About PowerShow.com