Title: Data%20Matters
1 2Sloan Digital Sky Survey
4 petabytes (1MG)
10 petabytes/yr
Biomedical imaging
150 petabytes/yr
3 4 5massive input
output
Sample tiny fraction
Sublinear algorithms
6(No Transcript)
7Approximate MST
CRT 01
8Reduces to counting connected components
9E
no. connected components
2
var
ltlt (no. connected components)
10Shortest Paths
CLM 03
11Ray Shooting
CLM 03
- Volume
- Intersection
- Point location
12 13- 01101011011010101011001010101011010011100110101001
0100010
low-entropy data
- Takens embeddings
- Markov models (speech)
14Self-Improving Algorithms
Arbitrary, unknown random source
Sorting Matching MaxCut All pairs shortest
paths Transitive closure Clustering
15Self-Improving Algorithms
Arbitrary, unknown random source
1. Run algorithm for best worst-case behavior
or best under uniform distribution or best
under some postulated prior.
2. Learning phase Algorithm finetunes itself
as it learns about the random source through
repeated use.
3. Algorithm settles to stationary status
optimal expected complexity under (still
unknown) random source.
16Self-Improving Algorithms
- 0110101100101000101001001010100010101001
time T1
time T2
time T3
time T4
time T5
E Tk ? Optimal expected time for random source
17Sorting
(x1, x2, , xn)
- each xi independent from Di
- H entropy of rank distribution
18Clustering
K-median (k2)
19d
Minimize sum of distances
Hamming cube 0,1
20d
Minimize sum of distances
Hamming cube 0,1
21d
Minimize sum of distances
Hamming cube 0,1
KSS
22How to achieve linear limiting expected time?
dn
Input space 0,1
Identify core
Use KSS
prob lt O(dn)/KSS
Tail
23How to achieve linear limiting expected time?
NP vs P input vicinity ? algorithmic vicinity
Store sample of precomputed KSS
nearest neighbor
Incremental algorithm
24Main difficulty How to spot the tail?
25 26- 01101011011010101011001010101011010011100110101001
0100010
- 011010110110101010110010101010100111001001
0010
1. Data is accessible before noise
2. Or its not
2. Or ?
27- 011010110110101010110010101010100111001001
0010
1. Data is accessible before noise
28- 01101011011010101011001010101011010011100110101001
0100010
decode
encode
29- 01101011011010101011001010101011010011100110101001
0100010
Data inaccessible before noise
Assumptions are necessary !
30- 01101011011010101011001010101011010011100110101001
0100010
Data inaccessible before noise
1. Sorted sequence
2. Bipartite graph, expander
3. Solid w/ angular constraints
4. Low dim attractor set
31- 01101011011010101011001010101011010011100110101001
0100010
Data inaccessible before noise
data must satisfy
some property P
but does not quite
32f(x) ?
f access function
x
data
f(x)
But life being what it is
33f(x) ?
x
data
f(x)
34Humans
Define distance from any object to data class
35f(x) ?
filter
x
x1, x2,
g(x)
f(x1), f(x2),
g is access function for
36Similar to Self-Correction RS96, BLR93
except
about data, not functions
error-free
allows O(distance to property)
37d
Monotone function n ? R
Filter requires polylog (n) queries
38Offline reconstruction
39Offline reconstruction
40Online reconstruction
41Online reconstruction
42Online reconstruction
- don't mortgage the future
43Online reconstruction
- early decisions are crucial !
44monotone function
45(No Transcript)
46Frequency of a point
x
Smallest interval I containing gt I/2 violations
involving f(x)
47Frequency of a point
48Given x
1. estimate its frequency
2. if nonzero, find smallest interval
around x with both endpoints
having zero frequency
3. interpolate between f(endpoints)
49To prove
1. Frequencies can be estimated in
polylog time
2. Function is monotone over
zero-frequency domain
3. ZF domain occupies (1-2
) fraction
50 Bivariate concave function
Filter requires polylog (n) queries
51 bipartite graph
k-connectivity
expander
52 denoising low-dim attractor sets
53 5401001001110101011001000111o10010
Priced computation
accuracy
- spectrometry/cloning/gene chip
- PCR/hybridization/chromatography
- gel electrophoresis/blotting
55experimentation
computation
56Factoring is easy. Heres why
Gaussian mixture sample 00100101001001101010101.
Pricing data
Ongoing project w/ Nir Ailon
57Collaborators Nir Ailon, Seshadri Comandur, Ding
Liu