Title: Statistical Techniques for Analyzing Quantitative Data
1Statistical Techniques for Analyzing Quantitative
Data
- Maryam Ramezani
- Values in Computer Technology
- CSC 426
2Outline
3Role of Statistics in Research
- With Statistics , we can summarize large bodies
of data, make predictions about future trends
,and determine when different experimental
treatments have led to significantly different
outcomes. - Statistics are among the most powerful tools in
the research's toolbox.
4How statistics come to research?
- In quantitative research we use numbers to
represent physical or nonphysical phenomena - We use statistics to summarize and interpret
numbers
5Exploring and Organizing a Data Set
- Look at your data and find the ways of organizing
them - example Scores of test for 11 children
- What do you see?
Ruth 96, Robert 60, chuck 68, Margaret 88 Tom
56, Mary 92,Ralph 64, Bill 72,Alice 80 Adam
76,Kathy 84
6Exploring and Organizing a Data Set
Alphabetical Order
7Using Computer Spreadsheets to Organize and
Analyze Data
- Sorting
- Graphing
- Formulas
- What Ifs
- Save, Store, recall, update information
8Functions of Statistics
- Descriptive Statistics
- describes what the data look like
- Inferential Statistics
- inference about a large population by collecting
small samples.
9Considering the Nature of the Data
- Continuous or discrete
- Nominal, ordinal, interval or ratio scale
- Normal or non-normal distribution
10Continuous versus Discrete Variables
- Continuous Data takes on any value within a
finite or infinite interval. You can count, order
and measure continuous data. - Example height, weight, temperature, the amount
of sugar in an orange, the time required to run a
mile. - Discrete Data values / observations belong are
distinct and separate, i.e. they can be counted
(1,2,3,....). - Example the number of kittens in a litter the
number of patients in a doctors surgery the
number of flaws in one metre of cloth gender
(male, female) blood group (O, A, B, AB).
11Nominal Data
- the numbers are simply labels. You can count but
not order or measure nominal data - Example males could be coded as 0, females as 1
marital status of an individual could be coded as
Y if married, N if single. - classification data, e.g. m/f
- no ordering, e.g. it makes no sense to state that
M gt F - arbitrary labels, e.g., m/f, 0/1, etc
12Ordinal Data
- ordered but differences between values are not
important - e.g., Like scales, rank on a scale of 1..5 your
degree of satisfaction - rating of 2 rather than 1 might be much less than
the difference in enjoyment expressed by giving a
rating of 4 rather than 3. - You can count and order, but not measure, ordinal
data.
13Interval Data
- ordered, constant scale, but no natural zero
- differences make sense, but ratios do not
- e.g. 30-2020-10, but 20/10 is not twice
as hot! - e.g. Dates the time interval between the starts
of years 1981 and 1982 is the same as that
between 1983 and 1984, namely 365 days. The zero
point, year 1 AD, is arbitrary time did not
begin then
14Ratio Data
- Like interval data but has true zero
- Ordered, Constant scale, natural zero
- e.g., height, weight, age, length
15Normal and Non-Normal Distributions
16Normal Distribution
17Non-Normal Distributions
Skewed to the Left(Negatively Skewed)
Skewed to the Right (Positevely Skewed)
18Leptokurtic and Platykurtic Distributions
19Descriptive Statistics
- Descriptive Statistics describes data
- Points of Central Tendency
- Amount of Variability
- Relation of different variables to each other
20Points Of Central Tendency Mean
- Measuring center If the n observations are x1,
x2,, xn, arithmetic mean is
Geometric Mean
e.x. Biological growth, Population growth
21Measure of Central Tendency
22Measures of Variability
How great is the Spread? RangeHighest
Score-Lowest score the quartiles The pth
percentile of a distribution is the value such
that p percent of the observations fall at or
below it. The 50th percentile median, M The
25th percentile first quartile, Q1 The 75th
percentile third quartile, Q3 Interquartile
Quartile 3- Quartile 1
- Example
- 13 13 16 19 21 21 23 23 24 26 26 27 27
27 28 28 30 30 - M?, Q1?, Q3?
23Measures of Variability
Standard Devastation
standardized score
24Measure of Relationship Correlation
- correlation indicates the strength and direction
of a linear relationship between two variables. - See page 266 for other examples or correlation
statistics
25Notes about Correlation
- Substantial correlations between two
characteristics needs reasonable Validity and
Reliability in measuring - Correlation does not indicate causation
26Examples of using Statistics in Computer Science
- Conceptual Representation of User Transactions or
Sessions
Pageview/objects
Session/user data
27Inferential Statistics
- We use the samples as estimate of population
parameter. - The quality of all statistical analysis depends
on the quality of the sample data
Random Sampling every unit in the population
has an equal chance to be Chosen A random sample
should represent the population well, so sample
statistics from a random sample should provide
reasonable estimates of population parameters
28Some definitions
- Parameter describes a population
- Statistic describes a sample
A parameter is a characteristic or quality of a
population that in concept is constant ,however,
its value is variable. example radius is a
parameter in a circle
29Inferential Statistics
- Estimate a population parameter from a random
sample - Test statistically hypotheses
30Inferential Statistics Estimate a Population
Parameter from Sample
- All sample statistics have some error in
estimating population parameters - Example estimate mean height of 10 year old boys
in Chicago, Sample200 boys - How close the sample mean is to the population
mean? - we dont know but we know
- The mean from an infinite number of samples form
a normal distribution. - The population mean equals the average (mean) of
all samples. - The Standard deviation of sample distribution (
standard error) is directly related to the std
of the characteristic in question for the overall
population.
31Standard Error
- Standard error tell us how much the particular
mean vary from one sample to another when all
samples are the same size and drawn randomly from
the sample population. - Standard Error
- n is size of all samples and s is the population
std which we dont have! - We use the std of sample
32Accuracy of the Estimator
As in many problems, there is a trade off between
accuracy and dollars.
What we will get from our money if we
invest dollars in obtaining a larger size?
n 100? n 200?
33Point versus Interval Estimate
- A point estimate is a single value--a
point--taken from a sample and used to estimate
the corresponding parameter of a population - , s, s2 and r estimate µ, s, s2, ?
respectively - An interval estimate is a range of values--an
interval within whose limits a population
parameter probably lies. - we say that we are 95 confident that the unknown
population mean lies in the interval
95 confidence interval for µ.
(x -2?/(n1/2), x2 ?/(n1/2))
- In only 5 of all samples,
- the sample mean x is not in the above interval,
- that is 5 of all samples give inaccurate results.
34Testing Hypothesis
- Confidence intervals are used when the goal of
our analysis is to estimate an unknown parameter
in the population. - A second goal of a statistical analysis is to
verify some claim about the population on the
basis of the data. - Research Hypothesis /Statistical hypothesis
- A test of significance is a procedure to assess
the truth about a hypothesis using the observed
data. The results of the test are expressed in
terms of a probability that measures how well the
data support the hypothesis.
35Example To determine whether the mean nicotine
content of a brand of cigarettes is greater than
the advertised value of 1.4 milligrams, a health
advocacy group takes a sample of 500 cigarettes
and measures the amount of nicotine in the
sample.
Sample values The sample average of nicotine
1.51 mlg The standard deviation 1.016.
The estimated amount of nicotine is 1.51mlg,
based on the sample values. The standard error
of the sample average is S.E.s.d./sqrt(n-1)0.04
5 Is there an actual difference between the
sample value (1.51mlg) and the advertised value
(1.4 mlg)? Or is it just due to sampling
error? To answer this question we need a Test of
Significance
36Stating an hypotheses
The null hypothesis H0 expresses the idea that
the observed difference is due to chance. It is a
statement of no effect or no difference,
and is expressed in terms of the population
parameter.
Let ? denote the true average amount of
nicotine. H0 ? 1.4mlg
The alternative hypothesis Ha represents the idea
that the difference is real. It is expressed as
the statement we hope or suspect is true instead
of the null hypothesis.
The alternative hypothesis states that the
cigarettes contain a higher amount of nicotine,
that is Ha ? gt 14mlg
37General comments on stating hypotheses
- It is not easy to state the null and the
alternative hypothesis! - The hypotheses are statements on the population
values. - The alternative hypothesis Ha is often called
researcher hypothesis, because it is the
hypothesis we are interested about. - A significance test is a test against the null
hypothesis - Often we set Ha first and then Ho is defined as
the opposite statement!
38Errors in Hypothesis testing
- Type I Error the null hypothesis is rejected
when it is in fact true that is, H0 is wrongly
rejected. - Type II Error the null hypothesis H0, is not
rejected when it is in fact false
39Meta- Analysis
- Meta-analysis refers to the analysis of
analyses...the statistical analysis of a large
collection of analysis results from individual
studies for the purpose of integrating the
findings. (Glass, 1976, p. 3) - Conduct a fairly extensive search for relevant
studies - Identify appropriate studies to include in
meta-analysis - Convert each studys results to a common
statistical index
40Using Statistical Software Packages
- SPSS
- SAS
- Matlab Statistics toolbox
- SYSTAT, Minitab, Stat View, Statistica
41Interpreting the Data
- Relating the findings to the original research
problem and to the specific research questions
and hypothesis - Relating the findings to preexisting literature,
concepts, theories and research results. - Determining whether the findings have practical
significance as well as statistical significance - Identifying limitations of the study