Statistics in MATLAB - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Statistics in MATLAB

Description:

Higham and Higham, 2000, MATLAB Guide, SIAM. ... There are vultures in the local park.' There are no vultures in the local park. ... – PowerPoint PPT presentation

Number of Views:767
Avg rating:3.0/5.0
Slides: 39
Provided by: harry8
Category:

less

Transcript and Presenter's Notes

Title: Statistics in MATLAB


1
Statistics in MATLAB
  • COMM2M
  • Harry R. Erwin, PhD
  • University of Sunderland

2
Resources
  • http//www.mathworks.com/access/helpdesk/help/pdf_
    doc/stats/stats.pdf This can be found in the
    COMM2M Lectures folder as STATS.PDF.
  • Higham and Higham, 2000, MATLAB Guide, SIAM.
  • James E. Gentle, 2002, Elements of Computational
    Statistics, Springer.
  • Wendy L. Martinez Angel R.  Martinez, 2002,
    Computational Statistics Handbook with MATLAB,
    Chapman Hall/CRC.
  • Michael J. Crawley, 2005, Statistics An
    Introduction Using R, Wiley. Our Statistics Study
    Group is working through this.

3
Doing Computational Statistics
  • Usually you do computational statistics to
    explore the structure of data. The questions you
    might ask are rather open-ended. Your
    understanding is facilitated by a model.
  • A model embodies what you currently know about
    the data. You can formulate it either as a
    data-generating process or a set of rules for
    processing the data.

4
Statistical Models
  • Often expressed as a set of equations relating
    data elements.
  • Can include probability distributions for the
    elements. If this is the case, you have a
    stochastic model.
  • The model should be free to evolve based on data
    mining.

5
Common Stochastic Models
  • Parameterized statistical distributions, such as
    the normal distribution, binomial distribution,
    or the chi-squared distribution.
  • Sometimes more complicated, where you need to use
    simulation, resampling, and visualization to
    determine the parameters of the model.

6
Structure-in-the-data
  • Of most interest, for example
  • Modes
  • Gaps
  • Clusters
  • Symmetry
  • Shape
  • Deviations from normality

7
Visualization
  • Multiple views are necessary
  • Be able to zoom in on the data as a few points
    can obscure the interesting structure.
  • Scaling of the axes may be necessary, since our
    eyes are not perfect tools for detecting
    structure.
  • Watch out for time-ordered or location-ordered
    data, particularly if time or location are not
    explicitly reported.

8
Plots
  • Use simple plots to start with.
  • Watch for rounded datashown by horizontal strata
    in the data. That often signals other problems.

9
Statistical Activities
  • Data collection (ideally the statistician has a
    say on how they are collected)
  • Description of a dataset
  • Averages
  • Spreads
  • Extreme points
  • Inference within a model or collection of models
  • Model selection

10
How to Do It
  • Start by determining what sort of statistical
    analysis should you do. You need to know
  • Which variable is the response variable?
  • Which are the explanatory variables?
  • What kind are the explanatory variables?
  • What kind of response variable do you have?

11
Basic Method of Analysis
  • If all explanatory variables are continuous, plan
    on a regression analysis.
  • If all explanatory variables are categorical,
    plan for an analysis of variance (ANOVA).
  • If you have a mix, plan for an analysis of
    covariance (ANCOVA)

12
Effect of Response Variable
  • If the response variable is continuous, then plan
    on a normal regression, ANOVA, or ANCOVA.
  • If the response variable is a proportion, do a
    logistic regression.
  • If a count, you need a log linear model.
  • If binary, you need a binary logistic analysis
  • If time to event or time at death, you will be
    doing a survival analysis.

13
Variation
  • You want to understand how the response is
    dependent on variation in the explanatory
    variables, but you are also interested in lack of
    dependence.
  • Design the simplest model that explains the data
    adequately.

14
Significance
  • You have to determine what the probability of a
    false alarm will bethat is, that you will think
    something is significant that really isnt.
  • Typical values are 5, 1, and 0.1.
  • Dont test every hypothesis. Some will be true by
    chance.

15
Good and Bad Hypotheses
  • There are vultures in the local park.
  • There are no vultures in the local park.
  • Which is testable?
  • The null hypothesis is testable. You test it by
    taking measurements and showing that if the null
    hypothesis is true, the chance of those
    measurements is nearly zero.

16
Experimental Design
  • Replication
  • Increases reliability, so be thorough. Usually
    the answer is 30.
  • Randomization
  • Reduces bias, so do it properly
  • Almost never done properly
  • Discuss

17
Controls
  • No controls, no conclusions.
  • A control experiment is one where you dont apply
    the treatment or dont enable the part of your
    experiment that is supposed to produce the
    different outcome.

18
Replication
  • Must be independent
  • Not part of a time series
  • Not grouped together in space
  • Of an appropriate spatial scale
  • Covers the normal variation in initial conditons.

19
Error Types
20
Typical ? and ? values
  • You usually want the probability of rejecting the
    null hypothesis (?) when it is true to be less
    than 5.
  • You usually want the probability of accepting the
    null hypothesis (?) when it is false to be less
    than 20.
  • The power of a test is 1- ?, or greater than 80
    in this case.
  • Rule of Thumb the number of replicates to reject
    the null hypothesis with probability 80 is about
    8s2/d2, where s2 is the variance in the response
    and d is the size of the difference to be
    detected in a single sample.

21
Inference
  • Strong inference
  • A clear hypothesis
  • An acceptable test
  • Weak inference
  • Natural experiments
  • Conclusions from natural experiments are
    hypotheses.

22
How Long to Go On?
  • To stop the experiment as soon as a pleasing
    result is obtained?
  • To keep going until the theoretically correct
    result is obtained?
  • Discuss.

23
Statistics in MATLAB
  • MATLAB has some useful statistical tools you can
    use to do all this (although most computational
    statistics is done using FORTRAN, SAS, R, or
    S-Plus).
  • Supports the usual range of statistical tasks,
    including both analysis and visualization.
  • Following is an overview of the capabilities of
    the MATLAB statistics toolbox.

24
Statistics Capabilities
  • Probability distributions
  • Descriptive statistics
  • Linear and non-linear models
  • Hypothesis testing
  • Multivariate statistics
  • Plotting
  • Statistical process control,
  • Design of experiments, and
  • Hidden Markov models.

25
Random number generators
  • There are functions in the Statistics Toolbox
    that return random output.
  • These allow the user to observe probability
    distributions, evaluate statistical tests, and
    use resampling techniques.

26
Probability distributions
  • These are used to display possible probability
    distributions and create histograms.
  • MATLAB provides the pdf, cdf, cdf-1, a random
    number generator, and mean and variance
    estimators for each distribution.

27
Continuous Distributions Provided
  • Beta
  • Exponential
  • Extreme value
  • Gamma
  • Lognormal
  • Normal
  • Rayleigh
  • Uniform
  • Weibull

28
Continuous Statistical Distributions
  • Chi-square
  • Non-central Chi-square
  • F
  • Non-central F
  • t
  • Non-central t

29
Discrete distributions
  • Binomial
  • Discrete uniform
  • Geometric
  • Hypergeometric
  • Negative binomial
  • Poisson

30
Descriptive statistics
  • mean
  • median
  • variance
  • standard deviation
  • Grouped data

31
Linear and non-linear models
  • ANOVA
  • Covariance analysis (ANCOVA)
  • Multiple linear regression
  • Quadratic response surface models
  • Stepwise regression
  • GLM
  • Robust and nonparametric methods
  • Nonlinear least squares
  • Regression and Classification Trees (CART)

32
Hypothesis testing
  • Null hypothesis
  • Alternative hypotheses
  • Significance level
  • p-value
  • Confidence intervals
  • A number of tests are provided (this is a hard
    area)

33
Multivariate statistics
  • Principal components analysis
  • Factor analysis
  • MANOVA
  • Cluster analysis
  • Multidimensional scaling

34
Plotting and Visualization
  • Box plots
  • Distribution plots
  • Scatter plots

35
Statistical process control
  • Quality of manufactured goods
  • Control charts
  • Capability studies

36
Design of experiments
  • Full factorial designs
  • Fractional factorial designs
  • Response surface designs
  • D-optimal designs

37
Hidden Markov Models
  • Concepts
  • Markov chains
  • Analysis of hidden Markov models (HMMs).

38
Conclusions
  • MATLAB provides a basic engineering toolkit for
    these statistical activities.
  • Not as broad as R or S-plus, but compatible with
    data collected or generated by other toolkits.
  • Supports all activities well.
  • More specialized work (e.g., Bayesian analysis)
    requires either your own extensions or more
    specialized toolkits.
Write a Comment
User Comments (0)
About PowerShow.com