Population Stratification with Limited Data - PowerPoint PPT Presentation

About This Presentation

Title:

Population Stratification with Limited Data

Description:

Population Stratification with Limited Data By Kamalika Chaudhuri, Eran Halperin, Satish Rao and Shuheng Zhou – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 22

Provided by: kamalika

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Population Stratification with Limited Data

1
Population Stratification with Limited Data

By
Kamalika Chaudhuri, Eran Halperin, Satish Rao and
Shuheng Zhou

2
The Problem

Given
Samples from two hidden distributions P1 and P2
Unknown labels
Each sample/individual
k features 0/1 values
Population P1 feature f is 1 w.p. p1f
Population P2 feature f is 1 w.p. p2f
Unknown feature probabilities

3
The Problem

Given
2n samples from two hidden distributions P1 and
P2
Unknown labels

Goal Classify each individual correctly for most
inputs

4
Applications

Preprocessing step in statistical analysis
Analyze the factors that cause a complex disease,
such as cancer
Cluster the samples into populations, then apply
statistical analysis
Collaborative Filtering
Feature can be likes Star Wars or not
Cluster users into types using the features

5
The Problem

Given
Samples from two hidden distributions P1 and P2
Unknown labels

Need Some Separation Between the Distributions
6
Our Results

Need some separation between the distributions!
Measure of Separation distance between means
? L1 distance between means / k
? L22 distance between means / k
Our Results
Optimization function and poly-time algorithm
? k W(vk log n)
Optimization function ? k W( log n)

7
Our Results

This talk
Optimization function and poly-time algorithm
? k ?(vk log n)
Example
P1 For each feature f, p1f ½
P2 For each feature f, p2f ½ vlog n/vk
Information-theoretically optimal
There exists two distributions with this
separation and constant overlap in probability
mass

8
Optimization Function

What measure to optimize to get the correct
clustering?

Need a robust measure which works for small
separations

9
A Robust Measure

Find the best balanced partition (S,S) such
that
?f Nf(S) Nf(S)
is maximum
Nf(S), Nf(S) of individuals with feature f
in S, S

10
A Robust Measure

Find the best balanced partition (S,S) such
that
?f Nf(S) Nf(S)
is maximum
Nf(S), Nf(S) of individuals with feature f
in S, S

Theorem Optimizing this measure provides the
correct partition w.h.p. if
? k ?(vk log n)
11
Proof Sketch

How does the optimal partition behave?

E f(P) ? k n k vn Pr f(P) Ef gtnvk
2-n
E f(Any partition) k vn Pr f(P) Ef gt
nvk 2-n
The partition with the optimal value of f in (I)
dominates all the partitions in (II) w.h.p for
the separation conditions
12
An Algorithm

How can we find the partition which optimizes
this measure?

Theorem There exists an algorithm which finds
the correct partition when ? k ?(vk
log2n) Running Time O(nk log2 n)
13
An Algorithm

Algorithm
Divide individuals into two sets A and B
Start with a random partition of A
Iterate log n times
Classify B using current partition of A and a
proximity score
And the same for A

14
An Algorithm

Iterate
Classify B using current partition of A and a
score
And vice versa.
Random Partition
( 1/2 1/vn) imbalance
Each iteration produces a partition with more
imbalance

15
Classification Score

Our Score For each feature f,
If Nf(S) gt Nf(S)
add 1 to the score if f is present,
else subtract 1
If Nf(S) lt Nf(S)
add 1 to the score if f is absent,
else subtract 1
Classify
Individuals above the median score S
Individuals below the median score S

16
Classification

Lemma If the current partition has (1/2
?)-imbalance, the next iteration produces a
partition with (1/2 2?)-imbalance for ? lt c
Lemma If the current partition has (1/2
c)-imbalance, the next iteration produces the
correct partition with our separation conditions.
?(log n) rounds needed to get the correct
partition
Use a fresh set of features in each round to get
independence

17
Proof Sketch