Bodhisattva Sen - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Bodhisattva Sen

Description:

Two notions of Multivariate Quantiles Geometric Quantiles ... Grey Scale Image of the P-values for the two methods with varying kernel regression bandwidths. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 38
Provided by: bhos
Category:
Tags: bodhisattva | grey | sen

less

Transcript and Presenter's Notes

Title: Bodhisattva Sen


1
  • Bodhisattva Sen

2
Topics
  • Fractile Graphical Analysis (FGA) definition
    motivation Prof. Mahalanobis Method
    Statistical Inference Using FGA ours ideas.
  • Two notions of Multivariate Quantiles
    Geometric Quantiles and PCMs Fractile.
  • Extension of FGA to the Multiple Covariate Setup
    Using the above two notions of Quantiles.
  • Evaluation of the performance of the methods
    using synthetic data and real life data.

3
Want to compare the regression functions for two
populations with single covariate.
  • We could look at
  • Instead we look at

and
and
where F, F are the distribution functions of X
and X.
When X and X are not in comparable scales the
above does the necessary standardization.
4
Mahalanobis Idea for Comparing Fractile
Graphs(Econometrica,1960)
  • Divide the dataset into fractile groups
    according to the x-variable. Plot the y-averages
    for each fractile group and join the consecutive
    points (y-averages) by straight lines.
  • Compute the area between the two fractile graphs
    for the two different samples.This is called the
    separation area.
  • Using two sub samples from the same population
    the significance of the observed separation area
    is tested.

5
Limitations of Mahalanobis Idea
  • The exact distribution of the test statistic
    separation area is not known.Only some
    approximations were tried by Mahalanobis and his
    co-workers.
  • He divided the original dataset into two parts
    and used the area between the two sub-samples
    (called error area) to test the significance of
    the observed separation area. The error area
    calculated in this way has different variance
    than the separation area due to the decrease in
    the number of sample points.
  • The accuracy of the approximation was poor and
    it was known to Mahalanobis.

6
Statistical Inference in FGAOur Ideas
  • We discuss two methods to test the equality of
    the regression functions (single covariate).
  • Swap the Response, (Method I)
  • Resampling from the Joint Density of (X,Y)
    (Method II).

7
Swap the Response
  • Transform the covariates into the corresponding
    quantiles, i,e, the ith ranked x-value is
    transformed to i/(n1) (where n data points).
  • Draw the fractile graphs by smoothing the
    y-values using usual kernel regression estimates
    and compute the separation area between them.
  • To test the significance of the observed
    separation area we resample from the two
    samples in the following way

8
We form two resampled datasets and compute the
separation area between them. Case (I)
Suppose n1n2 (i.,e., both the samples are of the
same size). For the ith data point (i.,e., with
quantile value i/(n1)) we interchange or swap
the y-value for the two datasets with probability
0.5 and keep the original y-values with
probability 0.5. Case (II) The sample sizes
are different. We interpolate the dataset from
the 2nd population to find the y-value at
i/(n11)th quantile and then repeat the same idea
as above. Similarly we get the other resampled
dataset. (Modified Swap Method)
9
Resample from the Joint Density of (X,Y)
  • We use usual kernel density estimates with very
    small bandwidth (bandwidth proportional to the
    inverse of sample size) to generate a pair of
    resampled datasets from each population. It is
    like resampling from a smoothed version of the
    empirical distribution of (X,Y).
  • We compute the separation area for each pair
    of resampled datasets from the same population.
    It gives us the distribution of the separation
    area under Ho.(Note that we have 2 separation
    areas corresponding to the 2 populations).

10
Plots of datasets along with the Fractile
Graphs. The models are Y 1.0 X e and Y
1.2 X e, where eN(0,0.09),XN(0,1) where N
data points 100.
11
Swap the Response Method an illustration
12
Resample from Joint Distribution of (X,Y) an
illustration.
13
How do we define Quantiles in the Multivariate
setup?
We discuss two such notions of
Multivariate Quantiles
  • 1.Geometric Quantiles and 2.PCMs Fractile

14
Geometric Quantiles
Chaudhuri (JASA 1996), Koltchinskii (Annals of
Stat. 1997)
  • If XP (X in Rd) then we define the Geometric
    Quantile QP(u) for u in B(d) u u lt 1 as

where
and the norm is the usual Euclidean norm
and lt , gt denotes the usual inner product.
15
Properties of Geometric Quantiles ?
  • The solution QP(u) always exist for any u and it
    is unique if P is not supported on a straight
    line.
  • QP(u) characterizes the associated distribution,
    i.e, QP1(u) QP2(u) implies P1P2 .
  • Computation of the sample geometric quantile for
    the data data set X1,X2,Xn ,via

is straightforward.
  • The Geometric Quantiles have an asymptotic
  • Multivariate normal distribution and the
  • convergence rate is n1/2.

16
Drawbacks of Geometric Quantile
  • No simple distributional interpretation as was
    the case with the univariate quantiles.
  • Though the Geometric Quantile is equivariant with
    respect to change in shift,any orthogonal
    transformation and homogeneous scale
    transformation, it is not equivariant under
    heterogeneous scale transformation.
  • (Affine equivariant versions of Geometric
    Quantiles are defined using Transformation-Retrans
    formation methods.) (see On Afiine Equivariant
    Multivariate Quantiles, Biman Chakraborty, Ann.
    Inst. Stat. Math, Vol. 53, No.2, 380-403 (2001))
    . However we do not intend to consider it here.

17
Another notion of Multivariate Quantile PCMs
Fractile - Mahalanobis(1970)
For a d-dimensional random vector X
(X1,X2,,Xd) P, we define the PCMs
distribution function HP from Rd to 0,1d as
where
18
Why should we consider PCMs fractile ?
  • If P and Q are two distributions with continuous
    densities of Rd and HP(x) HQ(x) for all x in Rd
    then
  • P Q.
  • Define Z(Z1,Z2,,Zd) Q s.t Zi fi (xi) for
    all i1,,d where each fi is a strictly monotone
    function on R. Then HP(x) HQ(f (x)) where f(x)
    (f1(x1), f2(x2),,fd(xd)). This shows that the
    PCMs fractile is equivariant under co-ordinate
    wise monotonic transformations.
  • The PCMs fractile has simple probabilistic
  • interpretations.

19
Drawbacks of PCMs Fractile
  • Requires the computation of the conditional
    distributions which is estimated by kernel
    density estimation.This estimation is very
    unstable in high dimensions due to the lack of
    data points. It is also computationally very
    intensive and time consuming.
  • PCMs Fractile map is not n1/2 consistent as it
    uses density estimates which are nv consistent
    (vlt1/2). v may be much smaller than ½ when the
    covariate dimension is high(Curse of
    Dimensionality in Non Parametric Function
    Estimation).
  • PCMs Fractile depends on the ordering of the
    co-ordinate random variables we have to fix an
    order and then work with it.

20
Extension of Fractile Graphical Analysis in the
Multiple Covariate Case
  • We use two different notions of Multivariate
    Quantiles
  • (Geometric Quantiles and PCMs Fractile).
  • We extend the method of Resampling from the
    joint density of (X,Y) using both the above
    notions of Multivariate Quantiles.

21
Methodology using PCMs Fractile
  • Transform the datasets to

and
where H is the PCMs distribution function.
  • Smooth the two transformed datasets by using the
    usual multivariate kernel regression estimates to
    get the two fractile surfaces.

22
3. The difference between the two fractile
graphs gives the separation volume between the
two populations. 4. To find the significance
level for the observed volume we resample a
pair of data sets from the same population using
very small bandwidth(from a fitted joint
density of X and Y) and recalculate the
volumebetween the two graphs for the same
population.In this way we try to estimate the
distribution of the separation volume under Ho.
23
Remarks
  • We used Least Square Cross Validation methods in
    choosing the bandwidth parameters in kernel
    regression and kernel density estimates (note
    that we use LSCV optimal bandwidths for KDE for
    the transformation x?H(x)).
  • The method described depends highly on the
    fitted joint distribution of the covariate
    (i.,e., X) as we are computing the Quantiles
    from this kernel density estimate.

24
Methodology using Geometric Quantiles
  • We transform the covariates by using the
    tranformation x ? FP(x) where FP is the
    M-distribution function. The empirical
    M-distribution function is 1/n Si (x-Xi)
    (x-Xi)-1.
  • We use the kernel regression estimates to smooth
    the transformed data sets using LSCV (Least
    Squares Cross Validation) optimal bandwidths.

25
  • We compute the fractile volume between the two
    surfaces (for the two populations).
  • We resample a pair of datasets from each
    population using kernel density estimates with
    very small bandwidths and compute the
    resampled fractile volume.In this way we
    simulate the distribution of the fractile volume
    under both the populations separately.

26
Remarks
  • For choosing the kernel regression bandwidths
    for X we assume h1h2hd where hi is the
    bandwidth for Xi (note that X (X1,Xd)).This
    makes the computation relatively simple and
    computationally feasible.
  • This method has the advantage that we do not
    need any LSCV optimal kernel density estimates as
    was the case with PCMs Fractile.

27
Evaluation of the Methods using Synthetic Data
  • Models considered
  • Single covariate
  • Y a bX e ,where e N(0,s2), XN(0,1).
  • Y 1/(1 X2k) e, where k0.5,1.0,1.5,2.0,2.5.
  • Y cs X2s eX2 e , where s1,2,3,4 cs
    normalizing constant
  • Multiple covariates (2 covariates)
  • Y a b1X1 b2X2 e ,where e N(0,s2),
    X1N(0,1), X2N(0,1).

28
Remarks
  • The Swap Method has a natural tendency to lower
    the resampled fractile area. Thus it exhibits
    high power but has high observed level also.
  • The Swap Method essentially works on the number
    of crossings of the two datasets.
  • The Swap Method works better when the error
    variance is large compared to the variance of X.

29
  • As the difference in the variance of the error
    terms in the two models increase, the two
    P-values for the Resampling from Joint Density
    Method show significant difference.
  • The Resampling from Joint Density Method gives
    good results in all the models considered by us
    whereas the Swap the Response Method behaves
    very unsatisfactorily in the two nonlinear models
    considered.
  • We recommend the Resampling from Joint Density
    method.

30
Case Studies
  • State wise data from scheduled commercial banks -
    RBI BSR data
  • Amount Outstanding (as a fraction of Credit
    Limit) VS Credit Limit.
  • Credit Deposit Ratio VS Total Deposit.
  • Credit deposit Ratio VS Distribution of Employees
    (officers,clerks and sub-ordinates) (2
    covariates and 3 covariates).
  • Credit Deposit Ratio VS accounts and offices.
  • RBI data on private corporate firms
  • Gross Company Profits VS Sales and Paid Up
    Capital.

31
     
     
Amount Outstanding (as a ratio of Credit Limit)
VS Credit Limit
32
             
Credit Deposit Ratio VS Total Deposit
33
Grey Scale Image of the P-values for the two
methods with varying kernel regression
bandwidths. (Amount Outstanding (as ratio of
Credit Limit) VS Credit Limit for the year 1996
and 1998).
Using 1996
Using 1998
Resample from the Joint density of (X,Y)
Swap the Response Method
34
Grey Scale Image of the P-values for the two
methods with varying kernel regression
bandwidths. (Amount Outstanding (as ratio of
Credit Limit) VS Credit Limit for the year 1997
and 2000).
Using 1997
Using 2000
Resample from the Joint density of (X,Y)
Swap the Response Method
35
Remarks (for Single Covariate)
  • The fractile graph of the year 2002 is very
    different from the other years.
  • Over the years the ratio Amount Ouststanding to
    Credit Limit has decreased.
  • For the pair wise comparison tests between
    Credit Deposit Ratio and Total Deposit most of
    the P-values for the are very high.The the
    fractile graph for the year 2002 is somewhat
    different from the rest.

36
Remarks (for Multiple Covariates)
  • With the inclusion of the subordinates the
    regressions functions change (as indicated by the
    change in P-values).
  • The subordinates have decreased steadily over
    the years from 1996 to 2002.
  • The P-values for the pair wise comparisons for
    Credit Deposit Ratio VS offices and accounts
    are very high.

37
Thank you
Write a Comment
User Comments (0)
About PowerShow.com