Estimating the Sortedness of a Data Stream - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Estimating the Sortedness of a Data Stream

Description:

Cormode-Muthukrishnan-Sahinalp, LibenNowell-Vee-Zhu, Ailon-Chazelle-Commandur-Liu] ... [LibenNowell-Vee-Zhu, Sun-Woodruff] Computing Ed( ) in other models: ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 48
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Estimating the Sortedness of a Data Stream


1
Estimating the Sortedness of a Data Stream
Parikshit Gopalan U T Austin
T. S. Jayram IBM Almaden
Robert Krauthgamer IBM Almaden
Ravi Kumar Yahoo! Research
2
Data Stream Model of Computation
X1 X2 X3 Xn
Input
  • Computing with Massive data sets.
  • Sequential access.
  • Small storage space, update time.

Alon-Matias-Szegedy,
3
Sorting on Data-Streams
Cannot sort efficiently. Can we tell if the data
needs to be sorted?
Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan, Ajtai-
Jayram-Kumar-Sivakumar, Gupta-Zane, Cormode-Muthuk
rishnan-Sahinalp, LibenNowell-Vee-Zhu, Ailon-Chaze
lle-Commandur-Liu
4
(No Transcript)
5
(No Transcript)
6
Sorting on Data-Streams
  • Cannot sort efficiently on a data-stream.
  • Can we tell if the data needs to be sorted?
  • Ergun-Kannan-Kumar-Rubinfeld-Vishwanathan,
  • Ajtai-Jayram-Kumar-Sivakumar, Gupta-Zane,
  • Cormode-Muthukrishnan-Sahinalp,
    LibenNowell-Vee-Zhu,
  • Ailon-Chazelle-Commandur-Liu
  • Measuring distance from Sortedness
  • Kendall Tau distance
  • Spearman Footrule distance
  • Ulam distance

7
Candidate metrics
1. Spearmans footrule l1 distance
3 5 7 9 10 4 1 2 6 8
?
1 2 3 4 5 6 7 8 9 10
e
Easy to compute.
8
Candidate metrics
2. Kendall Tau distance No. of Inversions
Inversions Positions i lt j where ?(i) gt ?(j)
3 5 7 9 10 4 1 2 6 8
?
9
Candidate metrics
2. Kendall Tau distance No. of Inversions
Inversions Positions i lt j where ?(i) gt ?(j)
3 5 7 9 10 4 1 2 6 8
?
10
Candidate metrics
2. Kendall Tau distance No. of Inversions
Within a factor-2 of Spearmans footrule.
Diaconis-Graham An O(log n) space, 1-pass (1
?) algorithm. Ajtai-Jayram-Kumar-Sivakumar
11
Candidate metrics
3. Ulam distance Edit Distance Ed(?) Number
of deletions needed to sort.
Ulam Fastest way to sort a bridge hand.
12
Edit Distance and the LIS
Ed(?) Number of deletions needed to sort.
5 7 8 1 10 4 2 3 6 9
13
Edit Distance and the LIS
Ed(?) Number of deletions needed to sort.
5 7 8 1 10 4 2 3 6 9
Delete
5 7 8 10
Insert
1 2 3 4 5 6 7 8 9 10
14
Edit Distance and the LIS
Ed(?) Number of deletions needed to sort
?. LIS(?) Length of the longest increasing
sequence. Ed(?) LIS(?) n
  • Studied in statistics, biology, computer science
  • Both take a global view of the sequence.
  • Hard for models like streaming, sketching,
    property-testing.

51 80
151 190
81 100
15
Prior Work
  • Exact Computation of Ed(?) and LIS(?)
  • Patience Sorting Ross,Mallows

16
Patience Sorting
5 7 8 1 10 4 2 3 6 9
5 7 8 1 10 4 2 3 6 9 0
5
7
8
17
Patience Sorting
5 7 8 1 10 4 2 3 6 9
5 7 8 1 10 4 2 3 6 9 0
1
5
7
8
10
18
Patience Sorting
5 7 8 1 10 4 2 3 6 9
5 7 8 1 10 4 2 3 6 9 0
4
1
5
8
10
7
19
Patience Sorting
5 7 8 1 10 4 2 3 6 9
5 7 8 1 10 4 2 3 6 9 0
2
1
4
5
8
10
7
Number in place i Earliest end to IS of length
i.
20
Patience Sorting
5 7 8 1 10 4 2 3 6 9
5 7 8 1 10 4 2 3 6 9 0
2
3
1
4
5
10
7
8
Number in place i Earliest end to IS of length
i.
21
Patience Sorting
5 7 8 1 10 4 2 3 6 9 0
2
3
1
6
4
5
7
8
10
9
Number in place i Earliest end to IS of length
i.
22
Patience Sorting
5 7 8 1 10 4 2 3 6 9 0
2
0
3
6
4
1
5
7
8
10
9
Number in place i Earliest end to IS of length
i.
23
Patience Sorting
5 7 8 1 10 4 2 3 6 9 0
LIS
2
0
3
6
4
1
5
7
8
10
9
Length of LIS
24
Prior Work
  • Exact Computation of Ed(?) and LIS(?)
  • Patience Sorting Ross,Mallows
  • O(n) space, 1-pass streaming algorithm.
  • ?(vn) space lower bound. LibenNowell-Vee-Zhu
  • Approximating Ed(?) and LIS(?)
  • No sub-linear space algorithms, no lower bounds.
  • Ajtai et al, Cormode et al, LibenNowell et al
  • LIS Algorithms parametrized by length of LIS
  • LibenNowell-Vee-Zhu, Sun-Woodruff
  • Computing Ed(?) in other models
  • Property Testing Ergun et al, Ailon et al
  • Sketching Charikar-Krauthgamer

25
Our Results
  • Approximating Ed(?)
  • An O(log2 n) space, randomized 4-approximation
    for Ed(?).
  • A O(vn) space, deterministic (1
    e)-approximation for Ed(?).
  • Approximating the LIS
  • A O(vn) space, deterministic (1
    e)-approximation for LIS(?).
  • Exact Computation of Ed(?) and LIS(?)
  • An ?(n) space lower bound for randomized
    algorithms.
  • Independently proved by Sun-Woodruff.
  • Lower bounds for approximating the LIS
  • Conjecture Deterministic algorithms require
    ?(vn) space for (1 e)-approximation

26
Computing the Edit Distance
Thm For any e gt 0,there is a one-pass
randomized algorithm using O(e-2log2 n) space and
update time, that gives a (4 e) approximation
to Ed(?).
  1. Combinatorial measure that approximates Ulam
    distance. Builds on Ergun et al, Ailon et al.
  2. Sampling scheme to compute this measure in one
    pass.

27
A Voting Scheme Ergun et al.
  • Combinatorial measure called Unpopularity.
  • Neighborhoods of ?(i) Intervals starting or
    ending at i.

28
A Voting Scheme Ergun et al.
  • Combinatorial measure called Unpopularity.
  • Neighborhoods of ?(i) Intervals starting or
    ending at i.

Deciding if ?(i) is unpopular For every
neighborhood of ?(i) Every number in the
neighborhood votes on Is ?(i) out of order? If
majority in some neighborhood vote against ?(i),
it is marked unpopular.
Let U(?) denote no. of unpopular numbers. Ergun
et al Ed(?) U(?) Ailon et al U(?) 2
Ed(?)
29
A Voting Scheme Ergun et al.
  • Can we estimate U(?) using a streaming algorithm?

4 5 3 7 1 2
30
A Voting Scheme Ergun et al.
  • Can we estimate U(?) using a streaming algorithm?

4 5 3 7 1 2
Impossible to decide if ?(i) is unpopular before
seeing the entire input.
31
A New Voting Scheme
  • Neighborhoods of ?(i) Intervals ending at i.
  • If majority in some neighborhood vote against
    ?(i), it is marked unpopular.
  • Unpopularity based only on past, not the future.

Thm Let V(?) denote no. of unpopular numbers.
Then Ed(?)/2 V(?) 2 Ed(?)
32
A Voting Scheme
  • Let Ed(?) k. Then V(?) 2k.
  • Fix an optimal Bad set of size k to delete.

How many numbers can be Unpopular ? Partition
Unpopular into Good and Bad. Good numbers form an
increasing sequence. Good never votes against
Good. Good Unpopular Bad neighborhood !
33
A Voting Scheme
  • Let Ed(?) k. Then V(?) 2k.
  • Fix an optimal Bad set of size k to delete.

Good Unpopular Bad neighborhood ! If k
numbers are Bad, At most k are Good
Unpopular. Bad numbers might all be
Unpopular. Hence V(?) 2k.
34
A Voting Scheme
  • Let Ed(?) k. Then V(?) 2k.
  • Bound can be tight.

100 99 98 91 1 2 3 10 11 12 90
100 99 98 91 1 2 3 10 11 12 90
100 99 98 91 1 2 3 10 11 12 90
35
A Voting Scheme
  • Let V(?) k. Then Ed(?) 2k.
  • Fix the set of k Unpopular elements.
  • Algorithm to produce an increasing sequence
  • Scan right to left.
  • Delete Unpopular elements Inversions w.r.t last
    number in sequence.
  • At least half of deletions are Unpopular numbers.
  • What remains is an increasing sequence.

36
A Voting Scheme
  • Let V(?) k. Then Ed(?) 2k.
  • Bound can be tight.

11 50 91 92 93 100 1 2 3 10 51 90
11 50 91 92 93 100 1 2 3 10 51 90
11 50 91 92 93 100 1 2 3 10 51 90
37
A New Voting Scheme
  • Neighborhoods of ?(i) Intervals ending at i.
  • If majority in some neighborhood vote against
    ?(i), it is marked unpopular.
  • Unpopularity based only on past, not the future.

Thm Let V(?) denote no. of unpopular numbers.
Then Ed(?)/2 V(?) 2 Ed(?)
Can we estimate V(?) efficiently?
38
Outline of Sampling Scheme
  • Taking a vote in one neighborhood
  • Take O(log n) samples, take the (approx)
    majority.
  • Reservoir Sampling Vitter.

3
7
8
6
5
9
1
2
Computing V(?) Need O(log n) samples from
every neighborhood.
3
7
8
6
5
9
1
2
39
Outline of Sampling Scheme
Computing V(?) Need O(log n) samples from
every neighborhood.
Key observation Dont need samples across
intervals to be independent! Roughly O(log2 n)
samples suffice.
40
Deterministic Algorithm for LIS
Thm For any e gt 0,there is a one-pass
deterministic algorithm using O(n/e)1/2 space and
update time, that gives a (1 - e) approximation
to LIS(?).
Based on multiplayer communication protocol for
LIS
32 80
10 51 19
15 50
  • Algorithm simulates protocol for vn players.

41
Two-Player Protocol for LIS
3245 4582 8021
1000 5123 1319
n/2
Patience Sorting
6 24 1000
k
Multiples of ek
61000
1/e
42
Approximating the LIS
Consider k-player communication protocol for LIS
32 80
10 51 19
15 50
  • As k increases, maximum message size increases.

Conjecture For some e0 gt 0, every 1-pass
deterministic algorithm that gives a (1 e0)
approximation to LIS(?) requires ?(vn) space.
Proving the conjecture requires analyzing k vn
43
Lower Bounds for approximating the LIS
Conjecture For some e0 gt 0, every 1-pass
deterministic algorithm that gives a (1 e0)
approximation to LIS(?) requires ?(vn) space.
Candidate Hard Instances?
1.8 2.9 3.7 4.9
1.6 2.8 3.5 4.6
1.3 2.5 3.3 4.5
1 2 3.2 4.2
44
Lower Bounds for approximating the LIS
Conjecture For some e0 gt 0, every 1-pass
deterministic algorithm that gives a (1 e0)
approximation to LIS(?) requires ?(vn) space.
Candidate Hard Instances?
Yes
No
1.8 2.9 3.7 4.9
1.6 2.8 3.5 4.6
1.3 2.5 3.3 4.5
1 2 3.2 4.2
1.7 2.8 3.4 4.8
1.6 2.6 3.5 4.6
1.3 2.5 3.6 4.5
1.1 2.1 3.9 4.2
45
Lower Bounds for approximating the LIS
Conjecture For some e0 gt 0, every 1-pass
deterministic algorithm that gives a (1 e0)
approximation to LIS(?) requires ?(vn) space.
Candidate Hard Instances?
Yes
No
1.8 2.9 3.7 4.9
1.6 2.8 3.5 4.6
1.3 2.5 3.3 4.5
1 2 3.2 4.2
1.7 2.8 3.4 4.8
1.6 2.6 3.5 4.6
1.3 2.5 3.6 4.5
1.1 2.1 3.9 4.2
46
Lower Bounds for approximating the LIS
Conjecture For some e0 gt 0, every 1-pass
deterministic algorithm that gives a (1 e0)
approximation to LIS(?) requires ?(vn) space.
Candidate Hard Instances?
Yes
No
1.8 2.9 3.7 4.9
1.6 2.8 3.5 4.6
1.3 2.5 3.3 4.5
1 2 3.2 4.2
1.7 2.8 3.4 4.8
1.6 2.6 3.5 4.6
1.3 2.5 3.6 4.5
1.1 2.1 3.9 4.2
47
Open Problems
  • Estimate the Edit distance between two
    permutations.
  • Tight bounds for approximation
  • Show ?(vn) lower bound for deterministic
    algorithms.
  • Randomized algorithm for LIS ?

Thank You!
Write a Comment
User Comments (0)
About PowerShow.com