Title: Univariate and Multivariate Analysis
1Univariate and Multivariate Analysis
Suresh Rathi Program Consultant The INCLEN Trust
International New Delhi 110020 suresh_at_inclentrus
t.org
2- We owe a lot to the Indians, who taught us how
to count, without which no worthwhile scientific
discovery could have been made
Albert Einstein
3STATISTICS
- Defined as the
- Collection
- Compilation
- Presentation
- Analysis
- Interpretation
- OF DATA
- When it applies to medical sciences-
Bio-statistics
4 5Data Analyses
- Descriptive Statistics
- Frequency Distributions and Cross -Tabulations
- Measures of central tendency and dispersion
- Univariate / Bivariate Analysis
- t-tests and Analysis of Variance (ANOVA)
- Chi-square test
- Multivariate Analysis
- To adjust for simultaneous effects of multiple
factors or to control the effects of confounding
factors on the outcome variable.
6Descriptive analysis
- In the first step descriptive analysis will be
done, - Summarizing demographic variables by computing
means with standard deviation for continuous
variables and - Percentages for categorical variables.
7Univariate analysis
- t-tests and Analysis of Variance (ANOVA)
- Chi-square test
- Univariate logistic regression analysis will be
conducted by comparing two variables for each
variable of interest using odds ratio (OR) and
their 95 confidence intervals (CI).
8Epidemiology
Observation, measurement, analysis, correlation,
interpretation
- .the study of Distribution and Determinants of
diseases
How many? In whom? Where?, When?
What, How, Why
9Definitions
- HYPOTHESIS
- A statement of belief used in the evaluation of
population values - NULL HYPOTHESIS (Ho)
- A claim that there is no difference b/w
population mean (?) hypothesized value (?o) - ALTERNATE HYPOTHESIS (H1)
- A claim that disagrees with the Null Hypothesis
- TEST STATISTIC
- A statistic used to determine the relative
position of the mean in the hypothesized
probability distribution of sample means.
10Definitions
- CRITICAL REGION (REJECTION REGION)
- Region on the far end of the distribution
- If only one end of the distribution is involved,
the region is referred to as one-tailed test. - If both ends are involved, the region is known as
two-tailed test. - When the computed value falls in the critical
region, we reject the null Hypothesis. - The probability that a test statistic falls in
the critical region is denoted by ? - SIGNIFICANCE LEVEL
- Level that corresponds to the area in critical
region. - When a test statistics falls in this area the
result is called as Significant at ? level
11Definitions
- P-VALUE
- Area in the tail(s) of a distribution beyond the
value of the test statistic. - The probability that value of calculated test
statistic or a more extreme one, occurred by
chance alone is denoted by p - NON-REJECTION REGION
- Region of the sampling distribution not included
in ?. That is located under the middle portion
of the curve. - Non-Rejection Region is denoted by (1- ? )
- TEST OF SIGNIFICANCE (Hypothesis Test)
- Procedure used to establish the validity of a
claim by determining whether or not the test
statistic falls in the critical region. If it
does, the results are referred to as Significant.
12PROCEDURE FOR TEST OF SIGNIFICANCE (STEPS)
- I. State Null versus Alternate Hypothesis
- Ho ? ?o
- H1 ? ?o
- H1 ? ? ?o, H1 ? ?o
- II. Choose a significance Level
- ? ?o (?o 0.05 or 0.01)
- III. Compute the test Statistic (Z-test, t-test)
- x ?
- Z --------------
- ? / n
- x ?
- t --------------
- s / n
13PROCEDURE FOR TEST OF SIGNIFICANCE (STEPS)
- IV. Determine the critical Region
- Which is the region of Z-distribution or
t-distribution with ?/2 in each tail. - V. Reject the null Hypothesis if the test
statistic falls in the rejection Region - Do not reject the null Hypothesis if it falls in
the non-rejection Region - VI. State appropriate conclusion
14t- Distribution
- Unimodal
- Bell Shaped
- Symmetrical
- Extends initially in either direction
- An area under curve is equal to 1.0 (100)
- Areas under curve (?) and are a function of
quantity called degrees of Freedom (df) - df n-1
- df Measures the quantity of information
available in ones data that can be used in
estimating the Population Variance (?2). - Uses
- When population SD is not known
- Sample size less than 25
15EXAMPLE
- A smog alert is issued when the amount of
particular pollutant in the air is found to be
greater than 7ppm. Samples collected from 16
stations given an X of 7.84 with an S of 2.01. Do
these findings indicate that the smog alert
criterion has been exceeded or can the results be
explained by chance?
16- 1. Ho ? ? 7.0 H1 ? gt 7.0
- 2. ? 0.05
-
- 3. Test Statistic
- X - ? 7.84 - 7
- t ---------------- ------------------
1.68 - s/ n 2.01/ 16
- 4. Critical Region
- Since H1 gt 7.0 indicates one tailed tests. We
place all of ? 0.05 on the VE side. - From table of t distribution we find that,
- Df 15
- t 1.753
17- 5. Since calculated t 1.68 does not fall in
critical region we do not reject Ho.
Alternatively, we conclude the data were
insufficient to indicate that the critical air
pollution level of 7ppm.
18Chi-Square Test (X2)
- For Qualitative Data
- Smoker or Non-Smoker
- Normotensive or Hypertensive
- ? ( O E )2
- X2 ----------------
- E
- df Degree of Freedom
- (c-1) (r-1) (Columun 1) (Row 1)
19For Example
- In a study we find that 76 out of 100 children
treated with Vit C and 63 of 100-placebo group
caught cold. Does the developing cold differ
b/w the two groups.
20- 1. Ho The two groups are homogeneous in their
cold developing pattern. - H1 The two groups are not homogeneous in their
cold developing pattern. - 2. ? 0.05
- 3. Critical Region?
- X2 (c-1) (r-1)
21- 4. Test Statistic
- ? ( O E )2
- X2 ----------------
- E
- Row Total Column Total
- Expected Value ---------------------------------
---- - Grand Total
22- O E O-E (O-E)2 (O-E)2
-
--------------------------------------------------
--------------------------------- -
- 76 69.5 - 6.5 42.25 0.608
- 63 69.5 - 6.5 42.25 0.608
- 24 30.5 6.5 42.25 1.385
- 37 30.5 6.5 42.25 1.385
-
- ?3.986
23Multivariate analysis
- Multiple models
- Linear regression
- Logistic regression
- Cox model
- Poisson regression
- Loglinear model
- Discriminant analysis
- Choice of the tool according to the objectives,
the study, and the variables
24Multiple Regression
25Multiple Regression
Regression Analysis is the
estimation of the linear relationship between a
dependent variable and one or more independent
variables or covariates.
26Multiple Regression
- Linear
- Logistic
- Independent variables
- Dependent variable
27Simple linear regression
- Relation between 2 continuous variables (SBP and
age) - Regression coefficient b1
- Measures association between y and x
- Amount by which y changes on average when x
changes by one unit - Least squares method
Slope
y
x
28Multiple linear regression
- Relation between a continuous variable and a set
of i continuous variables - Partial regression coefficients bi
- Amount by which y changes on average when xi
changes by one unit and all the other xis
remain constant - Measures association between xi and y adjusted
for all other xi - Example
- SBP versus age, weight, height, etc
29Multiple linear regression
- Predicted Predictor variables
- Response variable Explanatory variables
- Outcome variable Covariables
- Dependent Independent variables
-
30- Multiple Logistic Regression
31Multivariate analysis
- Before conducting multivariate analysis,
association among independent variables will be
checked by chi-square test. All the variables
meeting the selection criteria will be entered
one by one, starting with the highly significant
factor from the univariate analysis. - Selection of final model will be based on
- Parsimony, (good sense)
- Biological interpretability and
- Statistical significance.
- The adjusted odds ratios (ORs) and their 95
confidence intervals (CIs) will be computed using
the estimates of parameters of final model. The
dependent variable will be dichotomous, - P-values will be noted to assess the model fit.
32THANKS