Title: CIS205 Forensic Statistics
1CIS205 Forensic Statistics
- Module Leader
- Michael.Oakes_at_sunderland.ac.uk
2Data Types, Location and Dispersion
- Chapter 2 of Introduction to Statistics for
Forensic Scientists by David Lucy (Wiley, 2005)
3Types of Data
- Nominal, simply classified into different
categories, the ordering having no significance
e.g. people classified by sex (male/female),
drugs classified by location (South America /
Afghanistan / Indian / Oriental) - Ordinal, data again classified into discrete
categories, but this time the ordering does
matter, e.g. the development of the third molar
classified into ten categories related to age
(Solari and Abramovitch, 2002). - Continuous data can take any value, e.g. the
concentration of magnesium in glass can be any
value between 0 and 5, such as 1.225.
4Types of Data (2)
- Nominal and Ordinal data types are known
collectively as discrete, because they place
entities into discrete exclusive categories. - All three data types are called variables.
- There are nominal and ordinal variables which are
used to classify other variables, called factors.
E.g. ?9-THC concentrations in marijuana seizures
from various years in the 1980s in Table 2.1.
Here ?9-THC is a continuous variable, and year
is an ordinal variable used as a factor to
classify ?9-THC.
5Table 2.1. Year and ?9-THC for marijuana seizures
(ElSohly et al, 2002)
6Table 2.2 Data of Table 2.1 classified by year
as a factor.
7Marijuana
- Marijuana
- Derived from the plant Cannabis
- Hashish concentrated
- Sinsemilla unfertilized flowering tops of the
female Cannabis plant - Active ingredient is THC
- Potency is normally 4-5
- Simsemilla averages 6-12
- Liquid hashish averages 8-22
- Potential medical uses
8(No Transcript)
9Identification of Marijuana
- Green Plant Material
- Dry Package in Paper
- Microscopic Examination
- Look for Bear Claw cystolythic hair on top
surface of leaf - Duquenois-Levine Color test (Screening)
- 2 vanillin, 1 acetaldehyde in Ethanol
- Hydrochloric acid purple color
- Chloroform heaver than water forms lower layer
Will pull purple color into lower layer - Thin Layer Chromatography (TLC)
- Results THC red color on plate
- Marijuana is a mixture of compounds
10Powders / Color Tests
- Marquis Test 2 formaldehyde in H2SO4
- Purple
- Opiates
- Orange to brown
- Amphetamine Meth
- Blue
- Ecstasy
- Red
- Aspirin
- Pink
- cocaine
11Populations and Samples
- Generally, in chemistry and biology, a sample is
something taken for the purposes of examination,
such as a fibre or piece of glass found at the
scene of the crime these would be termed
samples. - In statistics, sample has a different meaning. It
is a subset of a larger set, known as a
population. - In Table 2.1, the ?9-THC column gives
measurements of the ?9-THC in a sample of
marijuana seizures at the corresponding date. In
this case the population is marijuana seizures.
12Distributions
- A distribution is an arrangement of frequencies
of some observation in a meaningful order. - If all 20 values for the THC content of 1986
marijuana seizures on the next slide are grouped
into broad categories, i.e. the continuous
variable THC is made into an ordinal variable
with many values, then the frequencies of THC
content in each category can be tabulated - This table can be represented graphically as a
histogram.
13?9-THC concentrations in a sample of 20 marijuana
seizures taken in 1986, arranged in ascending
order
- 6.29
-
- 7.05 7.21
- 7.72 7.91
- 8.16 8.29 8.32 8.40 8.41 8.41
- 8.82 8.84 8.93
- 9.02 9.26
- 9.74, 9.95
- 10.30
- 10.70
14(No Transcript)
15The histogram
- The histogram, which gives the sample frequency
distribution for ?9-THC in marijuana from 1986,
has 3 important properties - It has a single highest point at about 8.25
?9-THC, the two ends of the distribution having
progressively lower frequencies as they get
further from the highest point. The curve is
unimodal, and shows that ?9-THC tends towards a
value about 8.25. - The distribution is more or less symmetric about
the 8.25 value, i.e. not skewed. - The distribution is dispersed about the 8.25
point in some measurable way.
16Location
- How do we measure the typical properties and
the dispersions ? - First some mathematical notation and terminology
is required.
17Arrays and Scalars
- Let x be an array such that x 2, 4, 3, 5, 4.
This means that x is a series of quantities
called an array which are indexed by the suffix
i, so that - n is the number of elements in array x. In this
case there are five elements in x, so that - n is a single number on its own, and is sometimes
referred to as a scalar
18Summation S
19Multiplication
- Mathematicians often leave out multiplication
signs, so rather than writing out 3 x a 6, they
write 3a 6. - But 3 x 4 12 would never be written as 34 12.
20There are 3 basic measures of location, mean,
median and mode.
Mean is the arithmetic mean, what we usually
think of as average, denoted by
In the previous example,
21Median
- Median is simply the value of the middle one of a
number of values ordered in increasing magnitude.
- If x 2,4,3,5,4, let x be an ordered vector
of x so that x 2,3,4,4,5. In the range 1 to 5
the central value is the third, so the median is
4. - For even n split the difference of the two middle
values
22Mode
- Mode is the value with most instances. In x
2,4,3,5,4 there are two occurrences of 4, so 4
is the modal value. - Technically, for the THC concentration data all
values are on a continuous scale, so there are no
repeats. However, if the data are grouped, as
with the histogram, the modal group for the
sample from 1986 is the one with the tallest
column, corresponding to a value of 8.25
(mid-point of modal group).
23Skewed distributions
- Using the correct measure of location is
important. - Usually this will be the mean, but in the case of
incomes the median and mode give a truer picture. - If x 12000, 20000, 21000, 11000, 9000, 7000,
13000, 85000, 120000 in then mean 33111,
median 13000. - This is an example of a skewed distribution, in
this case highly skewed towards the higher values
of income (positively skewed).
24The standard measure of dispersion is called
variance
The reason we use n-1 rather than n is to offset
the sample size.
There are other measures of dispersion, including
the inter-quartile range.
25Hierarchies of variation
- Measurements from empirical sources are nearly
always subject to some form of variability - The lowest level in the hierarchy is
observational variability an observation is made
on the same entity several times in exactly the
same way, and those observations are seen to
vary. - The magnitude of observational variability may be
zero for discrete variable types, but may be
considerable for continuous variables. - The next level up is within entity variability
the same entity is repeatedly measured, but we
vary the way in which it is measured. - Within sample variability is where different
entities from the same sample (such as the
composition of different fragments from the same
pane of glass). Again this may be zero for
discrete variable types. - Between sample variabiltiy, e.g. THC levels in
marijuana seizures in 1986 and 1987. - These stages in the hierarchy of variation tend
to be additive.
26Matlab Practicals