Title: Chapter 6. Descriptive Statistics
1Chapter 6. Descriptive Statistics
- 6.1 Experimentation
- 6.2 Data Presentation
- 6.3 Sample Statistics
- 6.4 Examples
2- Data a mixture of nature and noise.
- Is the noise manageable?
- The noise is desired to be represented by a
probability distribution. - Statistical inference
- The science of deducing properties of an
underlying probability distribution from data - Can we have information on the underlying
probability distribution? - The information is given in the form of
(functions of) data.
3Figure 6.1 The relationship between probability
theory and statistical inference
46.1 Experimentation6.1.1 Samples
- Population the set of all the possible
observations available from a particular
probability distribution. - Sample a subset of a population.
- Random sample a sample where the elements are
chosen at random from the population - A sample is desired to be representative of the
population. - Types of observations numerical and categorical
(nominal) - - numerical observation either integers or
real numbers - - categorical observation a machine breakdown
is classified as either mechanical, electrical or
misuse.
56.1.2 Examples
- Example 1 Machine breakdowns
- Suppose that an engineer in charge of the
maintenance of a machine keeps records on the
breakdown causes over a period of a year. - Suppose that 46 breakdowns were observed by the
engineer (see Figure 6.2). - What is the population from which this sample is
drawn? - Factors to consider to check the representative
of data - Quality of operators
- Working load on the machine
- Particularity of data observation (e.g., more
rainy days than other years)
6Figure 6.2 Data set of machine breakdowns
How representative is this years data set of
future years?
7Figure 6.4Data set of defective computer chips
- Example 2 Defective computer chips
- The chip boxes are selected at random.
8- Points to check on data
- What is the data type?
- Are the data representative?
- How is the randomness of the data realized?
- Statistical problem
- What is the population from which the data are
sampled?
96.2 Data presentation
- 6.2.1 Bar and Pareto charts
- 6.2.2 Pie charts
- 6.2.3 Histograms
- 6.2.4 Outliers
10Figure 6.7 Bar chart of machine breakdowns data
set
- Bar chart uses bars to represent the data.
- The length of a bar is proportional to the
frequency
11Figure 6.9 Pareto chart of customer complaints
for Internet company
- Pareto chart is a bar chart where the categories
- are arranged in order of decreasing frequency.
12Figure 6.12Pie chart for machine breakdowns data
set
- Pie Charts emphasize the proportion of each
category.
13Figure 6.14Histogram of computer chips data set
- Histograms are used to represent numerical data.
14Figure 6.16 Histograms of metal cylinder
diameter data set with different bandwidths
15Figure 6.18 A histogram with positive (or right)
skewness
The right-hand tail is longer and flatter than
the left-hand tail
16Figure 6.19 A histogram with negative (or left)
skewness
17Figure 6.20 A histogram for a bimodal
distribution
18Figure 6.21 Histogram of a data set with a
possible outlier
An outlier is an observation which is not from
the distribution from which the main body of the
sample is collected.
196.3 Sample statistics
- 6.3.1 Sample mean of a data set
- 6.3.2 Sample median
- the value of the middle of the ordered data
points - ex) if n 2k1 (odd), the sample median
- if n 2k (even), the sample median
- 6.3.3 Sample trimmed mean
- A trimmed mean is obtained by deleting some of
the largest and some of the smallest data
observations. - Usually a 10 trimmed mean is employed where the
top 10 of the data points are removed together
with the bottom 10 of the data points.
20Figure 6.22 Illustrative data set
21Figure 6.23Relationship between the samplemean,
median, and trimmed meanfor positively and
negativelyskewed data sets
22- 6.3.4 Sample mode
- For categorical or discrete data, the sample
mode may be used to denote the category or data
value that contains the largest number of data
observations. -
- 6.3.5 Sample variance (s2)
23- 6.3.6 The 100p Sample Percentile (pth Sample
Quantile) and sample quartile - The 100p sample percentile is the value such
that at least 100p percent of the data are less
than or equal to it and at least 100(1-p) percent
are greater than or equal to it. If there are two
values satisfying the condition, the 100p sample
percentile is the arithmetic average of these two
values. -
- a data set 2,5,6,7,8
- - p 0.1, i.e., 10 percentile sample
percentile is 2 - - p 0.2, i.e., 20 percentile sample
percentile is (25)/2 3.5 - The 25 sample percentile the first quartile
- The 50 sample percentile the second quartile
or the sample median - The 75 sample percentile the third quartile
24Figure 6.24 Boxplot of a data set
25- 6..8. Coefficient of variation (CV)
- the spread of the data relative to the middle
value
26- Recall the Chebyshevs inequality
- Let
- Then,
-
- In general,
- Theorem the weak law of large numbers
- Let be a sequence of i.i.d.
random variables, each having mean and
variance - Then, for any
-
27 28- Homework 6
- Read Chapter 6.
- Review Chapter 1 Chapter 5.
- Midterm Exam
- Date 10.24 (Wed)
- Time 900-1100