Title: CHAPTER14: INTRODUCTION TO DATA ANALYSIS
1CHAPTER14INTRODUCTION TO DATA ANALYSIS
214.1 INTRODUCTION
- There are many situations in business where data
is collected and analysed. - The key ideas of data analysis are important in
the modern business environment. - Summarising and understanding the main features
of the variables contained within the data, and
investigate the nature of any linkages between
the variables that may exist.
314.2 WHAT IS DATA
- Example 1
- Population the set of all people/objects of
interest in the study being undertaken. - Very large
- Enumerated precisely
- Cannot be Enumerated physically
Population member
4- The information for each member of the
population - Age
- Gender
- Parish
- Will you vote in the by-election?
- Will you vote for me?
- Variables one piece of information
- Five variables
5- To investigate the connection between the two
pairs of variables - 'Will you vote for me' and 'Age'
- 'Will you vote for me' and 'Gender'
- 'Will you vote for me' and 'Parish'
- Population data is used ? the outcomes of the
analysis are precise ? 'perfect information'
results.
6- Population the set of all customers
7- A sensible initial set of questions is
- Do you understand exactly what each variable is
measuring/recording? - Do you understand the problem under investigation
and are the objectives of the investigation
clear.?
814.3 DESCRIBING VARIABLES
- Classification of variable types
- Attribute variables
- Measured variables
9- Attribute Variables
- An attribute variable has its outcomes described
in terms of its characteristics or attributes. - Example 1 'By-Election Data'
10- Example 2 'Credit Data'
- Does the customer own their own house?
- 0Yes 1No
- The Region in which the customer is resident?
- 1South West
- 2South East
- 3London
- 4Midland
- 5North
- Handling attribute data is to give it a
numerical code 0, 1, 2 ,.
11- Measured Variable
- A measured variable is a variable that has its
outcomes measured the resulting outcome is
expressed in numerical terms. - Two types of measured variables
- Continuous variable continuous scale of
measurement(person's weight) - Discrete variable the number of passengers on
flight - Example 1 'By-Election Data'
- The measured variable in this data set is 'Age'
12- Example 2 'Credit Data'
- Measured variables as follows
1314.4 THE CONCEPT OF A STATISTICAL DISTRIBUTION
- Attribute Variable
- Gender of constituents (Example 1)
DISTRIBUTION OF GENDER IN THE CONSTITUENCY
14DISTRIBUTION OF REGION IN WHICH CUSTOMER IS
RESIDENT
15- Measured Variable
- Customer's Age (Example 2)
DISTRIBUTION OF AGE OF CUSTOMER
16- Household Income (Example 2)
DISTRIBUTION OF HOUSEHOLD INCOME
17- What does the distribution show?
- The area under the curve from one income value to
another measures the relative proportion of the
population having household incomes in that
range. - Lower than 10,000 is relatively rare
- Large proportion of the population have Household
incomes between 20,000 50,000
18- The Descriptive Statistics for Distribution of a
Measured Variable - Distribution of the height of adults in Great
Britain.
19- The height of children under 11 years of age
children's heights
adult's heights
20- Heights in two different countries, country A
and country B
DISTRIBUTION OF HEIGHTS COUNTRY A B
21- A statistical distribution for a measured
variable can be summarised using three key
descriptions - Centre of the distribution
- Width of the distribution
- Symmetry of the distribution
-
22- Measuring the Centre of a Distribution
- The Mean
- average value ?X/n
- Average Household Income
- symbol for the population mean ?
- Formally the population mean of a variable is
defined to be - ? ?X/n
- The Median
- The median value of the variable is defined to be
the particular value of the variable such that
half the data values are less than the median
value and half are greater. - Sorting all data in ascending order, the median
value is then the middle value in this list
23- Measuring the Width of a Distribution
- The Standard Deviation
- The Standard Deviation is the square root of the
average squared deviation from the mean. - Symbol of Standard Deviation ?
- ? is usually defined in terms of the variance ?
2as - ? 2 ?(X- ?)2/n
- Standard deviation is the square root of the
variance - Calculating the standard deviation for the
variable Household Income - Standard deviation is a relative measure of
spread (width), the larger the standard deviation
the wider the distribution.
24- Inter-quartile Range
- The inter-quartile range is the range over which
the middle 50 of the data values varies - To define the quartiles
- Q1 the value of the variable that divides the
distribution 25 to the left and 75 to the
right. - Q2 the value of the variable that divides the
distribution 50 to the left and 50 to the
right. - Q3 the value of the variable that divides the
distribution 75 to the left and 25 to the
right. - The inter-quartile range is the value Q3 - Q1
25- Calculating the Q1, Q2, Q3 for the variable
'Household Income' - Conventionally the mean and standard deviation
are one pair of measures of location and spread,
and the median and inter-quartile range as
another pair of measures.
26- Measuring the Symmetry (skewness) of a
Distribution - Pearson's coefficient of Skewness
- Pearson's coefficient of Skewness 3(mean -
median)/standard deviation - Quartile Measure of Skewness
- Quartile Measure of Skewness (Q1 - Q3) - (Q2
Q1)/(Q3 Q1) -
2714.5 SUMMARY
- What is Data
- Variables
- Two types of variable
- an attribute variable
- a measured variable
- The concept of a Statistical Distribution
- As applied to an attribute variable
- As applied to a measured variable
28- Descriptive Statistics for a measured variable
- Measures of Centre
- Mean
- Median
- Measures of Width
- Standard Deviation
- Inter-Quartile Range
- Measures of Symmetry (Skewness)
- Pearson's coefficient of Skewness
- Quartile Measure of Skewness
2914.6 THE NATURE OF A SAMPLE
- POPULATION
- Perfect Information
- In practice it is often impossible to enumerate
the whole population. - A sample drawn from the population to make
judgements (inferences) about the population.
30- SAMPLE
- Imperfect Information
- Random sample
- Each item in the population has an equal chance
of being included in the sample. - The KEY PROBLEM is to use this sample data to
draw valid conclusions about the population with
the knowledge of and taking into account the
'error due to sampling' - Unrepresentative sample
- How to Lie with Statistics
31- A Credit Scenario
- Population the set of all customers who used the
credit facilities between 1st January 2000 and
31st December 2001. - Sample Size 654 customers
- Data file BDMCREDIT.MTW
3214.8 DESCRIBING SAMPLE DATA
- Attribute variable the number of occurrences of
each attribute is obtained - Measured variable Sample descriptive statistics
describing the centre, width and symmetry of the
distribution are calculated.
33- Attribute Data
- C5 Does the customer own their own house? Coded
0 Yes, lNo - C6 The Region in which the customer is
resident? Coded - 1 South West
- 2 South East
- 3 London
- 4 Midlands
- 5 North
- Command STAT-TABLE-TALLY
34 35- Summary Statistics for Discrete Variables
- Counts (OWN-OCC)
- Percent(OWN-OCC)
- Distribution graph(OWN-OCC)
Do you Own your own house?
36- Summary Statistics for Discrete Variables
- Count(REGION )
- The information in form
- 74 or 11.31 of the respondents are from the
Southwest - 132 or 20.18 of the respondents are from the
Southeast - 165 or 25.23 of the respondents are from the
London area - 161 or 24.62 of the respondents are from the
Midlands - 122 or 18.65 of the respondents are from the
North
37- Measured Variables
- For the 'Credit Data
- C2 Customer's Age (AGE)
- C3 Household Income ( per annum) (SALARY)
- C4 Estimated monthly outgoing on
mortgage/rent/rates/utilities/credit card
payments etc. (PAYOUT) - C7 The Amount borrowed on credit (CREDIT)
- HISTOGRAM
38- BOXPLOT
- The BOXPLOT will prove to be a more useful way of
representing the picture of a sample distribution
when the data analysis used to examine the
connection between two sample variables is
discussed in later chapters.
3914.7 DATA ANALYSIS USING SAMPLE DATA
- Before attempting to analyse any data, the
analyst should - The problem under investigation is clearly
understood and the objectives of the
investigation have been clearly specified. Keep
asking questions until satisfactory answers have
been obtained. - The individual variables making up the dataset
are clearly understood.
40 41- Descriptive Statistics
- Measures of Centre
- Mean
- Sample Mean
- Median
- Measures of Width
- Standard Deviation
- Sample Standard Deviation S
- Sample Variance S2
- Inter-Quartile Range IQR
- Symmetry
42- Symmetry (Skewness)
- A distribution is skewed if one tail extends
farther than the other. - A value close to 0 indicates symmetric data.
- Negative values indicate negative/left skew.
- Positive values indicate positive/right skew.
- Example of a negative or left-skewed distribution
(skewness -1.44096)
43(No Transcript)
44- The Relationship between the descriptive
statistics and the Boxplot - The asterisks on the right hand side of the
median are indicating sample values that are in
some sense extreme
4514.9 INVESTIGATING RELATIONSHIPS BETWEEN
VARIABLES
- To investigate the relationship between
variables. - Response variable
- a variable that measures either directly or
indirectly the objectives of the analysis - Explanatory variable
- a variable that may influence the response
variable
46- Example 1
- A university wishes to investigate the salary of
its graduates five years after graduating - The questionnaire
- 'Current Salary'
- 'Starting Salary'
- 'Class of Degree' Coded lFirst, 2Upper
Second, 3Lower Second, 4Third, 5Pass. - 'Graduate's Gender' Coded lMale, 2Female.
- Response variable
- Current Salary (measured variable)
47- Explanatory Variable
- Staring Salary (measured variable)
- Class of Degree (attribute variable)
- 'Graduate's Gender (attribute variable)
48- Example 2 CREDIT scenario
- Objectives of the analysis
- To investigate the nature of credit transactions
- The variable 'The Amount borrowed on credit'
- The problem is to investigate the relationship
between 'The Amount borrowed on credit' and the
other variables. - Summary
49- Combinations of Response Variable and
Explanatory Variable
EXPLANATORY VARIABLE
50- The method for investigating the connection
between a response variable and an attribute
variable depends on the type of variable. - Investigating the connection between a measured
response and a measured explanatory variables - Investigating the connection between a measured
response and an attribute explanatory variables
51Homework
- Find or collect some data in your life or
business practice, answer the following questions - Draw the statistic distribution of data
- Calculate the Mean and Standard Deviation
- Calculate the Median and Inter-Quartile Range
- Calculate the Pearsons Coefficient of Skewness
and Quartile Measure of Skewness