R:%20An%20Open%20Source%20Statistical%20Environment - PowerPoint PPT Presentation

About This Presentation
Title:

R:%20An%20Open%20Source%20Statistical%20Environment

Description:

Introduction: the R Platform and Availability. R Learning Curve ... R Graphics: basic and multipanel plots (trellis) R: An Open Source Statistical Environment ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 25
Provided by: x3
Learn more at: https://unece.org
Category:

less

Transcript and Presenter's Notes

Title: R:%20An%20Open%20Source%20Statistical%20Environment


1
R An Open Source Statistical Environment
  • Valentin Todorov
  • UNIDO
  • v.todorov_at_unido.org

MSIS 2008 (Luxembourg, 7-9 April 2008)
2
Outline
  • Introduction the R Platform and Availability
  • R Learning Curve (is R hard to learn)
  • R Extensibility (R Packages)
  • R and the others (Interfaces)
  • R Graphics
  • R for Time series
  • R for Survey Analysis
  • R and the Outliers (Robust Statistics in R)
  • More R features (WEB, Missing data, OOP, GUI)
  • Summary and Conclusions

3
What is R
  • R is a system for statistical computation and
    graphics. It provides, among other things, a
    programming language, high-level graphics,
    interfaces to other languages and debugging
    facilities
  • Developed after the S language and environment
  • S was developed at Bell Labs (John Chambers et
    al.)
  • S-Plus a value added implementation of the S
    language- Insightful Corporation
  • much code written for S runs unaltered under R
  • Significantly influenced by Scheme, a Lisp
    dialect

4
What is R
  • Ihaka and Gentleman, University of Auckland (New
    Zealand)
  • 1993 a preliminary version of R
  • 1995 released under the GNU Public License
  • Now R-core team consisting of 17 members
    including John Chambers
  • R provides a wide variety of statistical (linear
    and non-linear modelling, classical statistical
    tests, time-series analysis, classification,
    clustering, robust methods and many more) and
    graphical techniques
  • R is available as Free Software under the terms
    of the GNU General Public License (GPL).

5
R Extensibility (R Packages)
  • One of the most important features of R is its
    extensibility by creating packages of functions
    and data.
  • The R package system provides a framework for
    developing, documenting, and testing extension
    code.
  • Packages can include R code, documentation, data
    and foreign code written in C or Fortran.
  • Packages are distributed through the CRAN
    repository http//cran.r-project.org -
    currently more than 1300 packages covering a wide
    variety of statistical methods and algorithms.
    base and recommended packages are included in
    all binary distributions.

6
R and the Others (R Interfaces)
  • Reading and writing data (text files, XML,
    spreadsheet like data, e.g. Excel
  • Read and write data formats of SAS, S-Plus, SPSS,
    STATA, Systat, Octave package foreign.
  • Emulation of Matlab package matlab.
  • Communication with RDBMS ROracle, RMySql,
    RSQLite, RmSQL, RPgSQL, RODBC large data sets,
    concurrency
  • Package filehash a simple key-value style
    database, the data are stored on disk but are
    handled like data sets
  • Can use compiled native code in C, C, Fortran,
    Java

7
R Graphics
  • One of the most important strengths of R simple
    exploratory graphics as well as well-designed
    publication quality plots.
  • The graphics can include mathematical symbols and
    formulae where needed.
  • Can produce graphics in many formats
  • On screen
  • PS and PDF for including in LaTex and pdfLaTeX or
    for distribution
  • PNG or JPEG for the Web
  • On Windows, metafiles for Word, PowerPoint, etc.

8
R Graphics basic and multipanel plots (trellis)
9
R Graphics parallel plot and coplot
10
R for Time Series
  • Package stats
  • classical time series modeling tools arima()
    for Box-Jenkins type analysis
  • structural time series StructTS()
  • filtering and decomposition decompose() and
    HoltWinters()
  • Package forecast additional forecast methods
    and graphical tools
  • Analyzing monthly or lower frequency time series
  • TRAMO/SEATS
  • X-12-ARIMA
  • accessible through the Gretl library
  • Task View Econometrics http//cran.r-project.org/
    web/views/Econometrics.html

11
R for Time Series Example
  • Fitting an ARIMA model to a univariate time
    series with arima() and using tsdiag() for
    plotting time series analysis diagnostic

12
R for Survey Analysis
  • Complex survey samples are usually analysed by
    specialized software packages SUDAAN, Bascula 4
    (Statistics Netherlands), etc.
  • STATA provides much more comprehensive support
    for analysing survey data than SAS and SPSS and
    could successfully compete with the specialized
    packages

13
R for Survey Analysis
  • R package survey - http//faculty.washington.edu
    /tlumley/survey/
  • stratification, clustering, possibly multistage
    sampling, unequal sampling probabilities or
    weights multistage stratified random sampling
    with or without replacements
  • Summary statistics means, totals, ratios,
    quantiles, contingency tables, regression models,
    for the whole sample and for domains
  • Variances by Taylor linearization or by replicate
    weights (BRR, jack-knife, bootstrap, or
    user-supplied)
  • Graphics histograms, hexbin scatterplots,
    smoothers
  • Other packages pps, sampling, sampfling

14
R and the Outliers (Robust Statistics in R)
  • What are Outliers
  • atypical observations which are inconsistent with
    the rest of the data or deviate from the
    postulated model
  • may arise through contamination, errors in data
    gathering, or misspecification of the model.
  • classical statistical methods are very sensitive
    to such data
  • What are Robust methods
  • Produce reasonable results even when one or more
    outliers may appear in the data
  • Robust regression - robustbase
  • Robust multivariate methods rrcov, robustbase
  • Robust time series analysis - robust-ts

15
R and the Outliers Example
  • Example Wages and Hours - http//lib.stat.cmu.edu
    /DASL/
  • a national sample of 6000 households with a male
    head earning less than 15,000 annually in 1966 -
    9 independent variables classified into 39
    demographic groups
  • estimate y the labour supply (average hours)
    from the available data (for the example we will
    consider only one variable x average age of
    the respondents
  • We will fit an Ordinary Least Squares (OLS) and a
    robust Least Trimmed Squares model

16
R and the Outliers Example OLS
17
R and the Outliers Example LTS
18
R and the Outliers Example Covariance
  • Marona Yohai (1998)
  • rrcov data set maryo
  • A bivariate data set with
  • sample correlation 0.81
  • interchange the largest and smallest value in the
    first coordinate
  • the sample correlation becomes 0.05

19
More R
  • R and the WEB - several projects that provide
    possibilities to use R over the WEB
  • R and the Missing advanced missing value
    handling
  • mvnmle ML estimation for multivariate data with
    missing values
  • mitools Tools for multiple imputation of missing
    data
  • mice - Multivariate Imputation by Chained
    Equations
  • EMV Estimation of Missing Values for a Data
    Matrix
  • VIM provides methods for the visualisation as
    well as imputation of missing data
  • R Objects R is an Object Oriented language
    (however in a quite different sense from C,
    Java, C)

20
More R
  • R GUI
  • R Commander a basic statistics GUI, consisting
    of a window containing several menus, buttons,
    and information fields
  • Sciviews a suite of companion applications for
    Windows
  • R and SDMX
  • R Reports
  • package xtable coerce data to LaTeX and HTML
    tables
  • package Sweave a framework for mixing text and R
    code for automatic report gene

21
Summary
  • Output Management System
  • SAS/SPSS it is rarely used for routine work
  • R output is easily passed from one function to
    another to do further processing and to obtain
    more results
  • Macro Language
  • SAS/SPSS a special language with own syntax. The
    new functions are not run in the same way as the
    built-in procedures
  • R itself is a programming language
  • Matrix Language
  • SAS/SPSS A special language with own syntax
  • R is a vector and matrix based language
    complemented by additional packages Matitrx,
    SparseM

22
Summary (cont.)
  • Publishing results
  • SAS/SPSS Cut and paste to a Word processor or
    exporting to a file
  • R produce LaTex output (including graphics)
    using for example the Sweave package
  • Data size
  • SAS/SPSS Limited by the size of the disk
  • R Limited by the size of the RAM, (not trivial)
    usage of databases for large data sets is
    possible
  • Data structure
  • SAS/SPSS Rectangular data set
  • R Rectangular data frame, vector, list

23
Summary (cont.)
  • Interface to other programming languages
  • SAS/SPSS Not available
  • R R can be easily mixed with Fortran, C, C and
    Java
  • Source code
  • SAS/SPSS Not available
  • R the source code of R itself as well as of its
    packages is a part of the distribution

24
References
  • Hornik, K and Leisch, F, (2005) R Version 2.1.0,
    Computational Statistics, 20 2 pp 197-202
  • Kabacoff, R. (2008) Quick-R for SAS and SPSS
    users, available from http//www.statmethods.net/i
    ndex.html
  • López-de-Lacalle, J, (2006) The R-computing
    language Potential for Asian economists, Journal
    of Asian Economics, 17 6, pp 1066-1081
  • Muenchen, R. (2007), R for SAS and SPSS users,
    URL http//oit.utk.edu/scc/RforSASSPSSusers.pdf
  • Murrel, P. (2005) R Graphics, Chapman Hall
  • R Development Core Team (2007) R A Language and
    Environment for Statistical Computing, R
    Foundation for Statistical Computing, Vienna,
    Austria, ISBN 3-900051-07-0. URL
    http//www.r-project.org/
  • Templ, M and Filzmoser, F (2008), Visualisation
    of Missing Values and Robust Imputation in
    Environmental Surveys, submitted for publication
  • Wheeler, D.A., (2007) Why Open Source Software /
    Free Software (OSS/FS, FLOSS, or FOSS)? Look at
    the Numbers!
Write a Comment
User Comments (0)
About PowerShow.com