YY Teo - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

YY Teo

Description:

Tabular EDA ... Numerical EDA. Calculating informative numbers which summarise the dataset ... Graphical EDA. Visual summaries of the data ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 28
Provided by: tyik
Category:
Tags: eda | teo

less

Transcript and Presenter's Notes

Title: YY Teo


1
Statistics for Medical Sciences Division
  • YY Teo

Summer 07
2
Resources
  • Lectures
  • Lecture Notes
  • Ramsey and Schafer (2002) The Statistical Sleuth
    (2nd Edition) Thomson Learning
  • Email teo_at_stats.ox.ac.uk

3
Lesson Structure
  • 5 lectures practical sessions (1.5hr)
  • Overview of statistics, exploratory data analysis
  • Z-tests, t-tests, Analysis of Variance (ANOVA),
    post-hoc analysis
  • Categorical analysis, odds ratio
  • Linear regression
  • Logistic regression

4
Overview of Statistics
5
Scientific Process
6
Analysing a set of data
  • Look at the data (initial checks on the data)
  • Downloading data, formatting, data collection,
    discrepant data, missing data
  • Visualize the data (exploratory data analysis)
  • Descriptive statistics, informative tables,
    well-constructed figures
  • Analyse the data (definitive analysis)
  • Formal statistical analysis
  • Quantify any interesting results
  • Report the findings

7
Computers and Statistics
  • Excel, SPSS, Minitab, Stata, Mathlab, R, etc
  • Advantages
  • Speed, accuracy, ease of data manipulation
  • Easy to produce plots, cross-tabulation tables,
    summary statistics
  • Disadvantages
  • Inappropriate analysis / use of wrong tests
  • Data dredging

8
Sample Selection
  • Simple Random Sample
  • Stratified Sample
  • Cluster Sampling
  • Multistage Sampling
  • Multi-phase Sampling

9
Types of Variables
  • Often, test to use depends on the type of
    variable at hand
  • Two main classes of variables
  • Categorical
  • Numerical
  • Categorical variables further divided into two
    sub-classes
  • Nominal categorical (example gender, ethnic
    groups)
  • Ordinal categorical (example size of a car,
    quality of teaching)

10
Numerical Variables
  • Distinguish between discrete or continuous
    numerical variables
  • Discrete
  • Integer values (number of male subjects, number
    of episodes of flu outbreaks)
  • Continuous
  • Takes a whole range of values (height, weight)
  • Continuous variables treated as discrete (age)

11
Exploratory Data Analysis
12
EDA
  • Tabular EDA
  • Univariate tables, cross-tabulation of
    categorical variables
  • Numerical EDA
  • Location, spread, skewness, covariance and
    correlation
  • Graphical EDA
  • Frequency plots, histograms, boxplots,
    scatterplots
  • The precise form of EDA depends on the data at
    hand.

13
Tabular EDA
  • Useful for summarising categorical data.

For example, the following table shows the
classification of 2,555 DNA samples in a
case-control study of Malaria onset into the
respective phenotypes
14
Tabular EDA
  • For two categorical variables i.e. the
    distribution of genotypes for a sub-sample of the
    individuals affected and unaffected by severe
    malaria

Question Appears to be more affected individuals
with the C allele (or conversely, more unaffected
individuals have the AA genotype). Does this mean
anything? (see later for test of independence)
15
Numerical EDA
  • Spread (range, standard deviation, interquartile
    range, mode)

16
Numerical EDA
  • Sample QuartilesQ1 25th quantile (or value of
    the 25 ranked data)Q2 50th quantile (also
    known as median of data)Q3 75th quantile (or
    value of the 75 ranked data)
  • Interquartile range (IQR) IQR Q3 Q1
  • Minimum, Maximum of data

17
Numerical EDA
  • Numbers can be informative to identify potential
    problems with the data
  • Example Suppose the height for 1,496 individuals
    randomly sampled from the population produces the
    following summary

IQR Q3 Q1 188 172 16 Range Max Min
201 0 201
18
Numerical EDA
  • Skewness
  • b1 is unit-free, where 0 indicates symmetry
    about sample mean. Positive values indicate
    right-skewness, and negative values indicate
    left-skewness.

19
Covariance and Correlation
  • Two numerical variables height and weight
  • Questions
  • Are there any relationship between these
    variables?
  • If there is, how do we quantify this
    relationship?
  • Covariance and correlation Measures the degree
    of association between two numerical variables.

20
Covariance and Correlation
  • Covariance is scale-dependent, and correlation
    is unit-free.
  • More intuitive to interpret correlation than
    covariance.
  • Example Covariance for height and weight is 2.4
    when assessed using metres and kilograms, but
    240,000 when assessed using centimetres and
    grams. Correlation is a constant value at 0.83
    for both scenario.
  • Correlation always bounded between 1 and 1
    inclusive.

21
Example
22
Graphical EDA
  • Visual summaries of the data
  • Flagging outliers, obvious relationships, check
    for distribution

23
Boxplots
  • Univariate boxplot for 1 numerical variable

Ends of box Q1 and Q3 Length of box IQR White
line Sample median Whiskers 1.5 times IQR Lines
outside whiskers Outliers Circles Extreme
outliers
24
Boxplots
  • Multivariate boxplots for 1 numerical variable
    across different levels of a categorical variable
  • Graphical comparison

25
Scatterplots
  • Graphical representation for 2 numerical
    variables

26
Scatterplots
27
Scatterplots
Write a Comment
User Comments (0)
About PowerShow.com