Title: Bodhisattva Sen
1 2Topics
- Fractile Graphical Analysis (FGA) definition
motivation Prof. Mahalanobis Method
Statistical Inference Using FGA ours ideas. - Two notions of Multivariate Quantiles
Geometric Quantiles and PCMs Fractile. - Extension of FGA to the Multiple Covariate Setup
Using the above two notions of Quantiles. - Evaluation of the performance of the methods
using synthetic data and real life data.
3Want to compare the regression functions for two
populations with single covariate.
and
and
where F, F are the distribution functions of X
and X.
When X and X are not in comparable scales the
above does the necessary standardization.
4Mahalanobis Idea for Comparing Fractile
Graphs(Econometrica,1960)
- Divide the dataset into fractile groups
according to the x-variable. Plot the y-averages
for each fractile group and join the consecutive
points (y-averages) by straight lines. - Compute the area between the two fractile graphs
for the two different samples.This is called the
separation area. - Using two sub samples from the same population
the significance of the observed separation area
is tested.
5Limitations of Mahalanobis Idea
- The exact distribution of the test statistic
separation area is not known.Only some
approximations were tried by Mahalanobis and his
co-workers. - He divided the original dataset into two parts
and used the area between the two sub-samples
(called error area) to test the significance of
the observed separation area. The error area
calculated in this way has different variance
than the separation area due to the decrease in
the number of sample points. - The accuracy of the approximation was poor and
it was known to Mahalanobis.
6Statistical Inference in FGAOur Ideas
- We discuss two methods to test the equality of
the regression functions (single covariate). - Swap the Response, (Method I)
- Resampling from the Joint Density of (X,Y)
(Method II).
7Swap the Response
- Transform the covariates into the corresponding
quantiles, i,e, the ith ranked x-value is
transformed to i/(n1) (where n data points). - Draw the fractile graphs by smoothing the
y-values using usual kernel regression estimates
and compute the separation area between them. - To test the significance of the observed
separation area we resample from the two
samples in the following way
8We form two resampled datasets and compute the
separation area between them. Case (I)
Suppose n1n2 (i.,e., both the samples are of the
same size). For the ith data point (i.,e., with
quantile value i/(n1)) we interchange or swap
the y-value for the two datasets with probability
0.5 and keep the original y-values with
probability 0.5. Case (II) The sample sizes
are different. We interpolate the dataset from
the 2nd population to find the y-value at
i/(n11)th quantile and then repeat the same idea
as above. Similarly we get the other resampled
dataset. (Modified Swap Method)
9Resample from the Joint Density of (X,Y)
- We use usual kernel density estimates with very
small bandwidth (bandwidth proportional to the
inverse of sample size) to generate a pair of
resampled datasets from each population. It is
like resampling from a smoothed version of the
empirical distribution of (X,Y). - We compute the separation area for each pair
of resampled datasets from the same population.
It gives us the distribution of the separation
area under Ho.(Note that we have 2 separation
areas corresponding to the 2 populations).
10Plots of datasets along with the Fractile
Graphs. The models are Y 1.0 X e and Y
1.2 X e, where eN(0,0.09),XN(0,1) where N
data points 100.
11Swap the Response Method an illustration
12Resample from Joint Distribution of (X,Y) an
illustration.
13How do we define Quantiles in the Multivariate
setup?
We discuss two such notions of
Multivariate Quantiles
- 1.Geometric Quantiles and 2.PCMs Fractile
14Geometric Quantiles
Chaudhuri (JASA 1996), Koltchinskii (Annals of
Stat. 1997)
- If XP (X in Rd) then we define the Geometric
Quantile QP(u) for u in B(d) u u lt 1 as
where
and the norm is the usual Euclidean norm
and lt , gt denotes the usual inner product.
15Properties of Geometric Quantiles ?
- The solution QP(u) always exist for any u and it
is unique if P is not supported on a straight
line. - QP(u) characterizes the associated distribution,
i.e, QP1(u) QP2(u) implies P1P2 . - Computation of the sample geometric quantile for
the data data set X1,X2,Xn ,via
is straightforward.
- The Geometric Quantiles have an asymptotic
- Multivariate normal distribution and the
- convergence rate is n1/2.
16Drawbacks of Geometric Quantile
- No simple distributional interpretation as was
the case with the univariate quantiles. - Though the Geometric Quantile is equivariant with
respect to change in shift,any orthogonal
transformation and homogeneous scale
transformation, it is not equivariant under
heterogeneous scale transformation. - (Affine equivariant versions of Geometric
Quantiles are defined using Transformation-Retrans
formation methods.) (see On Afiine Equivariant
Multivariate Quantiles, Biman Chakraborty, Ann.
Inst. Stat. Math, Vol. 53, No.2, 380-403 (2001))
. However we do not intend to consider it here.
17Another notion of Multivariate Quantile PCMs
Fractile - Mahalanobis(1970)
For a d-dimensional random vector X
(X1,X2,,Xd) P, we define the PCMs
distribution function HP from Rd to 0,1d as
where
18Why should we consider PCMs fractile ?
- If P and Q are two distributions with continuous
densities of Rd and HP(x) HQ(x) for all x in Rd
then - P Q.
- Define Z(Z1,Z2,,Zd) Q s.t Zi fi (xi) for
all i1,,d where each fi is a strictly monotone
function on R. Then HP(x) HQ(f (x)) where f(x)
(f1(x1), f2(x2),,fd(xd)). This shows that the
PCMs fractile is equivariant under co-ordinate
wise monotonic transformations. - The PCMs fractile has simple probabilistic
- interpretations.
19Drawbacks of PCMs Fractile
- Requires the computation of the conditional
distributions which is estimated by kernel
density estimation.This estimation is very
unstable in high dimensions due to the lack of
data points. It is also computationally very
intensive and time consuming. - PCMs Fractile map is not n1/2 consistent as it
uses density estimates which are nv consistent
(vlt1/2). v may be much smaller than ½ when the
covariate dimension is high(Curse of
Dimensionality in Non Parametric Function
Estimation). - PCMs Fractile depends on the ordering of the
co-ordinate random variables we have to fix an
order and then work with it.
20Extension of Fractile Graphical Analysis in the
Multiple Covariate Case
- We use two different notions of Multivariate
Quantiles - (Geometric Quantiles and PCMs Fractile).
- We extend the method of Resampling from the
joint density of (X,Y) using both the above
notions of Multivariate Quantiles.
21Methodology using PCMs Fractile
- Transform the datasets to
and
where H is the PCMs distribution function.
- Smooth the two transformed datasets by using the
usual multivariate kernel regression estimates to
get the two fractile surfaces.
223. The difference between the two fractile
graphs gives the separation volume between the
two populations. 4. To find the significance
level for the observed volume we resample a
pair of data sets from the same population using
very small bandwidth(from a fitted joint
density of X and Y) and recalculate the
volumebetween the two graphs for the same
population.In this way we try to estimate the
distribution of the separation volume under Ho.
23Remarks
- We used Least Square Cross Validation methods in
choosing the bandwidth parameters in kernel
regression and kernel density estimates (note
that we use LSCV optimal bandwidths for KDE for
the transformation x?H(x)). - The method described depends highly on the
fitted joint distribution of the covariate
(i.,e., X) as we are computing the Quantiles
from this kernel density estimate.
24Methodology using Geometric Quantiles
- We transform the covariates by using the
tranformation x ? FP(x) where FP is the
M-distribution function. The empirical
M-distribution function is 1/n Si (x-Xi)
(x-Xi)-1. - We use the kernel regression estimates to smooth
the transformed data sets using LSCV (Least
Squares Cross Validation) optimal bandwidths.
25- We compute the fractile volume between the two
surfaces (for the two populations). - We resample a pair of datasets from each
population using kernel density estimates with
very small bandwidths and compute the
resampled fractile volume.In this way we
simulate the distribution of the fractile volume
under both the populations separately.
26Remarks
- For choosing the kernel regression bandwidths
for X we assume h1h2hd where hi is the
bandwidth for Xi (note that X (X1,Xd)).This
makes the computation relatively simple and
computationally feasible. - This method has the advantage that we do not
need any LSCV optimal kernel density estimates as
was the case with PCMs Fractile.
27Evaluation of the Methods using Synthetic Data
- Models considered
- Single covariate
- Y a bX e ,where e N(0,s2), XN(0,1).
- Y 1/(1 X2k) e, where k0.5,1.0,1.5,2.0,2.5.
- Y cs X2s eX2 e , where s1,2,3,4 cs
normalizing constant - Multiple covariates (2 covariates)
- Y a b1X1 b2X2 e ,where e N(0,s2),
X1N(0,1), X2N(0,1).
28Remarks
- The Swap Method has a natural tendency to lower
the resampled fractile area. Thus it exhibits
high power but has high observed level also. - The Swap Method essentially works on the number
of crossings of the two datasets. - The Swap Method works better when the error
variance is large compared to the variance of X.
29- As the difference in the variance of the error
terms in the two models increase, the two
P-values for the Resampling from Joint Density
Method show significant difference. - The Resampling from Joint Density Method gives
good results in all the models considered by us
whereas the Swap the Response Method behaves
very unsatisfactorily in the two nonlinear models
considered. - We recommend the Resampling from Joint Density
method. -
30Case Studies
- State wise data from scheduled commercial banks -
RBI BSR data - Amount Outstanding (as a fraction of Credit
Limit) VS Credit Limit. - Credit Deposit Ratio VS Total Deposit.
- Credit deposit Ratio VS Distribution of Employees
(officers,clerks and sub-ordinates) (2
covariates and 3 covariates). - Credit Deposit Ratio VS accounts and offices.
- RBI data on private corporate firms
- Gross Company Profits VS Sales and Paid Up
Capital.
31 Amount Outstanding (as a ratio of Credit Limit)
VS Credit Limit
32 Credit Deposit Ratio VS Total Deposit
33Grey Scale Image of the P-values for the two
methods with varying kernel regression
bandwidths. (Amount Outstanding (as ratio of
Credit Limit) VS Credit Limit for the year 1996
and 1998).
Using 1996
Using 1998
Resample from the Joint density of (X,Y)
Swap the Response Method
34Grey Scale Image of the P-values for the two
methods with varying kernel regression
bandwidths. (Amount Outstanding (as ratio of
Credit Limit) VS Credit Limit for the year 1997
and 2000).
Using 1997
Using 2000
Resample from the Joint density of (X,Y)
Swap the Response Method
35Remarks (for Single Covariate)
- The fractile graph of the year 2002 is very
different from the other years. - Over the years the ratio Amount Ouststanding to
Credit Limit has decreased. - For the pair wise comparison tests between
Credit Deposit Ratio and Total Deposit most of
the P-values for the are very high.The the
fractile graph for the year 2002 is somewhat
different from the rest.
36Remarks (for Multiple Covariates)
- With the inclusion of the subordinates the
regressions functions change (as indicated by the
change in P-values). - The subordinates have decreased steadily over
the years from 1996 to 2002. - The P-values for the pair wise comparisons for
Credit Deposit Ratio VS offices and accounts
are very high.
37Thank you