Introduction to Data Analysis. - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Introduction to Data Analysis.

Description:

Introduction to Data Analysis. Levels of measurement and Descriptive statistics * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Calculating standard ... – PowerPoint PPT presentation

Number of Views:589
Avg rating:3.0/5.0
Slides: 49
Provided by: jame163
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Data Analysis.


1
Introduction to Data Analysis.
  • Levels of measurement and
  • Descriptive statistics

2
Whats this course about?
  • Introduction to the use of quantitative data in
    social science.
  • The tools we need in order to use numerical data
    (i.e. anything we can count) to better understand
    the world.
  • Very basic introduction, students intending to
    write theses using primarily quantitative data
    should also attend the intermediate/ advanced
    lectures.

3
Why am I here?
  • Your own research.
  • Using quantitative data as an integral part of
    your thesis.
  • Using quantitative data as supplementary
    evidence.
  • Making better use of qualitative data.
  • Other peoples research.
  • Understanding work in your area.
  • Criticising work in your area.
  • Its compulsory.

4
Todays lecture
  • Different types of data.
  • Descriptive statistics.
  • Averages of a distribution of data.
  • How to measure the spread of a distribution.
  • How (not) to lie with statistics.
  • Graphs and best practice.
  • Reading
  • A F - Chapter 1 3

5
What is Statistics?
  • Methods for
  • Designing and carrying out research studies
  • Describing collected data
  • Making decisions/inferences about phenomena
  • represented by data

6
Some key terms (1)
  • Populationthe total set of individual objects of
    persons of interest in a study
  • Samplea subset of the population that is
    actually observed

7
Key Terms (2)
  • Descriptive Stats consist of methods of graphical
    and numerical techniques for summarizing the
    information in a collection of data
  • Inferential stats consist of procedures for
    making generalizations about characteristics of a
    population, based on info from a sample.

8
Key terms (3)
  • Parameters are the characteristics of the
    population about which we make inferences using
    sample data
  • Statistics are the corresponding characteristics
    of the sample data, upon which we base our
    inferences about parameters.

9
Variables and their measurement
  • Variable measurement of a characteristic of a
    subject (something or someone) that varies across
    subjects in a population of subjects.
  • Different levels of measurement, which means that
    we have to examine different types of data in
    different ways.

10
Nominal level measures (1)
  • Just represent a category.
  • e.g. Male Female.
  • e.g. Single Married Divorced.
  • Since there is no ordering, these are nominal
    measures.
  • Often called qualitative, since two values differ
    in quality not quantity.

11
Nominal level measures (2)
  • Can quantify these data by tabulating them.
  • Normally represent nominal data in a simple table
    with percentages.
  • Take the marital status of all of my 25 friends
    (i.e. the population we are looking at is all
    Ryans friends).

Marital status Number
Single 18 72
Married 6 24
Divorced 1 4
Total 25 100
12
Ordinal level measures
  • Categories again, but these categories are
    ordered.
  • e.g. Many polling/survey questions.
  • It was right for Britain to send troops to Iraq
  • Strongly agree
  • Agree
  • Disagree
  • Strongly disagree.
  • The distance between each category is unknown.
  • Strong agreers are more hawkish than agreers,
    but we have no idea how much more hawkish they
    are.
  • We can say on observation is greater in rank than
    another.
  • Can be ranking in class (for example) or from
    naturally ordered categories
  • Called quantitative because different values
    represent different magnitudes.

13
Interval level measures
  • Numbers represent a quantitative variable.
  • e.g. Income, number of pupils per teacher, age,
    etc.
  • There is a specific distance between each level.
  • We can not only say that my sister is younger
    than I am, but that she is 2 years younger.
  • Age is a continuous variable, one can also
    subdivide the measure (784 days, 3 hours and 2
    minutes younger).
  • It is also true that my parents have only 2
    children.
  • Number of children is a discrete variable, you
    cannot sub-divide children, you have 1, or 2, or
    3. You cant have 2 ½ children.

14
Descriptive statistics
  • Most statistics that we will cover today apply to
    variables that are interval level measures.
  • Descriptive statistics are just that. They
    describe a large amount of data in a summary
    form.
  • Why bother? Because were often interested in
    what a typical person (or country or school or
    parliament etc.) looks like.

15
Measuring the central tendency
  • What we want to do is reduce a lot of interval
    level measurements to a few numbers.
  • The salaries of all of my best friends (the
    population is Ryans best friends).
  • What is the typical annual salary of a best
    friend of mine.

Name Salary
Ellen 75,000
Jenny 13,000
Justin 31,000
Andrew 26,000
Mungo 15,000
16
The mean
  • The most usual way of measuring the central
    tendency is to use the mean (or average).
  • This is simply the sum of the measurements
    divided by the number of observations.
  • For our salaried people
  • Mean 32,000

17
A (very) little bit of math
  • To introduce some terms which will be useful
    later, the mean is calculated as follows. Suppose
    we have n observations, with each value denoted
    by X1, X2 and so on until Xn. Then the mean is
    described as follows

Or, to put it another way
18
The means properties
  • Shift of origin of measurement.
  • If everyone earns 2000 more, then the new mean
    salary is just the old mean salary (32,000) PLUS
    2000.
  • Change of scale.
  • If we calculate salary in dollars (say 1 2),
    then the new mean salary is simply twice the old
    mean salary.
  • Sum of two variables.
  • Imagine that income salary savings interest.
  • Mean income mean salary mean savings
    interest.

19
The median
  • Another common way deriving one number to
    describe many is to use the median.
  • Imagine we ranked all observations, the median is
    simply the observation in the middle (½ of
    observations above and ½ below).
  • In ascending order the salaries are
  • 13,000 15,000 26,000 31,000 75,000.
  • Median 26,000.
  • Median ½(26,00031,000) 28,500.

20
The medians properties (1)
  • Shift of origin of measurement. YES
  • Change of scale. YES
  • Sum of two variables. NO
  • The lack of this property is somewhat important
    (which will become apparent in the following
    weeks), and is related to one of the reasons why
    we generally use the mean in most statistical
    analysis.
  • Nonetheless, the median does have some advantages
    over the mean in describing some types of data.

21
The medians properties (2)
  • For our salary example, the mean of my best
    friends salaries gives a substantially higher
    value than the median (6000 more).
  • This is due to the distribution of the
    observations. For the mean and median to be the
    same the distribution of observations needs to be
    symmetrical.
  • Imagine we now look at all my friends and
    acquaintances (the population of 25 people as
    before), and plot the frequency of each salary
    for all 25.

22
Frequency graph of salaries
Median 26,000
Mean 34,000
23
Positions of the median and mean
  • For distributions with a long tail to the right,
    the mean will take a higher value than the
    median.
  • This is generally true across the world for
    income distributions, and is captured by Pens
    parade of dwarfs and a few giants.
  • If such a parade were organised today, then the
    person of mean height (and income) would be
    taller (and richer) than 65 of the population
    and so would pass by after 40 minutes had
    elapsed.
  • Mean income is 24,000, median income is
    16,000.
  • For data with outliers the median can give a
    better idea of what the typical observation is
    like.

24
Ordinal level data
  • The median can be used for ordinal level data.
  • Imagine we had asked my 5 best friends about
    their position on the Iraq war 2 strongly agreed
    with sending British troops, one agreed, one
    disagreed and one strongly disagreed.
  • We can rank these answers and then find the
    median.
  • Strongly agree strongly agree agree disagree
    strongly disagree.
  • Thus the median answer is agree.

25
Nominal level data
  • In general, we cant use the median or mean for
    nominal data.
  • Normally use the mode. This is the most commonly
    occurring value.
  • e.g if 53 people here are politics students, 40
    sociology students, and 46 are other subjects,
    then the modal value is politics.
  • There is one special case in which we can use the
    mean for nominal data however

26
Nominal binary data
  • binary data is an exception as we can use the
    mean. Binary data (e.g. Yes/No, Male/Female) can
    be coded as 0 or 1.
  • A variable measuring sex, men are coded 1 and
    women coded 0.
  • The mean score for those 0s and 1s is the
    proportion of men. There were 2 women and 3 men
    amongst my best friends.
  • The median does NOT make sense for binary data.
    It just tells us what the majority of the
    population is.

27
Exercise
  • Population is all countries with nuclear
    capability, and variable is approximate number of
    nuclear weapons.
  • Whats the mean, mode and median for no. of
    nukes?
  • How good is each of these at summarizing the
    data, do we need more information than just a
    measure of central tendency?

Country No. of nukes
USA 10,000
India 75
China 400
France 400
Britain 200
Russia 12,800
Israel 100
Pakistan 25
28
Some answers
  • Mean 24,000/8 3,000
  • Median (400200)/2 300
  • Mode 400
  • These summary measures are useful, but we also
    need to know something about the distribution,
    because two countries account for virtually all
    the nuclear weapons in the world.

29
Measures of dispersion
  • The mean (or median) tells us something about the
    centre of the distribution, but what about its
    dispersion?
  • The means/medians of the below distributions of
    childrens scores on a maths test in three
    different classes are all the same (48
    observations, mean of 7, median of 7), but each
    tells a quite different story.

30
The range
  • The range is simply a measure of the distance
    between the largest and smallest observations.
  • The range for our salary example is therefore
  • 75,000 13,000 62,000.
  • Clearly this is not ideal as it relies on only
    two observations.
  • Say we have 1000 poker players. 999 win nothing,
    and 1 wins 1million. The range indicates lots of
    variation, when most people are actually
    identical.

31
The variance and standard deviation
  • A better way of assessing how much values of a
    variable vary around the mean is to use the
    standard deviation or variance.
  • Basic idea is to measure how different individual
    values are from the mean value.
  • Some of these deviations from the mean will be
    positive and some negative, so we square each
    deviation.

32
The variance
  • Take my 5 best friends. The mean salary was
    32,000.
  • If we added up all the differences then we would
    get zero, so we need to square the differences
    (i.e. multiply them by themselves).

Andrew
Ellen (75,000)
Justin
Jenny
Mungo (15,000)
15,000 - 32,000 -17,000
Difference 75,000 - 32,000 43,000
Mean32,000
33
Calculating variance
  • Salary example, with 5 obs, and mean of 32,000.

Salary (000s) Deviation from mean Squared deviation
75 75 - 32 43 43 43 1849
13 13 32 -19 -20 -20 361
31 31 32 -1 -1 -1 1
26 26 32 -6 -6 -6 36
15 15 32 -17 -17 -17 289
34
Calculating standard deviation
  • The standard deviation is the most common way to
    measure deviation from the mean and is simply the
    square root of the variance.
  • We normally call the variance s2 and the standard
    deviation s. Thus for our example, s2 507.2,
    and s 22.5.

we usually use n-1 in the denominator
35
Examples of standard deviation
s 1.02 Tight distribution (All children perform
similarly)
s 1.67 Clustered distribution (Most children
perform to a similar level, with some variation)
s 4.01 Dispersed distribution (One group of
geniuses, one group of idiots)
36
But what does it mean?
  • Our salary example had a standard deviation of
    22.5, but for the distributions above the s
    varied between 1 and 4, what does this tell us?
  • Best way to think of it is as a kind of rough
    average distance of an observation to the mean.
  • Thus the standard deviation depends on the units
    we are measuring in.

37
Standard deviation summary
  • Broadly speaking, high levels of s indicate
    greater variation, and the value of s gives a
    broad idea of a typical distance from the mean.
  • The concept of standard deviation is an important
    one, and next week Ill talk more about
    particular types of distributions and their
    properties.

38
How to (not) lie with statistics
  • Even simple descriptive statistics can be misused
    in order to mislead.
  • Particularly the case for simple graphs.
  • Most examples I will use here are from Edward
    Tufte The Visual Display of Quantitative
    Information (1983, and later reprints).
  • See any copy of any of the Sunday papers for
    similar glaring errors however.

39
Too little information
  • Presenting too little summary information.
  • Example courtesy of Tukey (1979) in JASA.
  • Take Washoe County in Nevada, USA. There is a
    mean population density of 13 ½ people per square
    mile.
  • The mean is not informative without information
    on the distribution however, for in fact 80 of
    the inhabitants live in two cities.
  • The cities have population densities of 5000 per
    square mile.
  • The rest of the county has a population density
    of 2 ½ people per square mile.

40
Base years (1)
  • Picking your base year (Tufte 1983).

41
Base years (2)
42
Measures over time (1)
43
Measures over time (2)
44
The lie factor
  • Are doctors really becoming smaller?

45
Small differences
  • Just because somethings top or bottom of a list,
    doesnt imply anything.
  • The difference between top and bottom might be
    very small.
  • Close to home, look at the Norrington table for
    this. The difference between the middle 10
    colleges is essentially zero, but its the
    ranking that everyone cares about.
  • Ranking of countries by something like literacy
    rates is often similarly futile. There has to be
    one at the top with 99.9 but all Western
    countries will have 99 rates

46
(very) Small samples
  • 9 out of 10 cats prefer Whiskers
  • We may think that the evidence for this is strong
    if thousands of cats had their opinion solicited,
    but maybe weak if only 10 cats were questioned
    out of the population of millions.
  • Knowing when a small sample is too small is one
    of the topics we will cover over the next two
    weeks and is a critical part of understanding
    commonly used statistics.

47
How to talk back to a statistic
  • Who says so?
  • We all want to prove our own theories correct
  • How does he know?
  • Is the data reputable?
  • Whats missing?
  • e.g. means are no use without standard
    deviations.
  • Does it make sense?
  • Social science is the science of the bloody
    obvious most of the time. Dont let numbers
    confuse or fool you if it sounds wrong, it
    probably is.

48
Next week
  • Go back to ideas of distributions of data, and
    commonly found distributions.
  • Also look at sampling and surveys.
  • Been discussing data where we have information on
    the entire population (all my friends all
    doctors in the US, etc.).
  • We might more normally have only a sample of
    observations though.
  • How accurate are samples in describing
    populations?
Write a Comment
User Comments (0)
About PowerShow.com