Title: R:%20An%20Open%20Source%20Statistical%20Environment
1R An Open Source Statistical Environment
- Valentin Todorov
- UNIDO
- v.todorov_at_unido.org
MSIS 2008 (Luxembourg, 7-9 April 2008)
2Outline
- Introduction the R Platform and Availability
- R Learning Curve (is R hard to learn)
- R Extensibility (R Packages)
- R and the others (Interfaces)
- R Graphics
- R for Time series
- R for Survey Analysis
- R and the Outliers (Robust Statistics in R)
- More R features (WEB, Missing data, OOP, GUI)
- Summary and Conclusions
3What is R
- R is a system for statistical computation and
graphics. It provides, among other things, a
programming language, high-level graphics,
interfaces to other languages and debugging
facilities - Developed after the S language and environment
- S was developed at Bell Labs (John Chambers et
al.) - S-Plus a value added implementation of the S
language- Insightful Corporation - much code written for S runs unaltered under R
- Significantly influenced by Scheme, a Lisp
dialect
4What is R
- Ihaka and Gentleman, University of Auckland (New
Zealand) - 1993 a preliminary version of R
- 1995 released under the GNU Public License
- Now R-core team consisting of 17 members
including John Chambers - R provides a wide variety of statistical (linear
and non-linear modelling, classical statistical
tests, time-series analysis, classification,
clustering, robust methods and many more) and
graphical techniques - R is available as Free Software under the terms
of the GNU General Public License (GPL).
5R Extensibility (R Packages)
- One of the most important features of R is its
extensibility by creating packages of functions
and data. - The R package system provides a framework for
developing, documenting, and testing extension
code. - Packages can include R code, documentation, data
and foreign code written in C or Fortran. - Packages are distributed through the CRAN
repository http//cran.r-project.org -
currently more than 1300 packages covering a wide
variety of statistical methods and algorithms.
base and recommended packages are included in
all binary distributions.
6R and the Others (R Interfaces)
- Reading and writing data (text files, XML,
spreadsheet like data, e.g. Excel - Read and write data formats of SAS, S-Plus, SPSS,
STATA, Systat, Octave package foreign. - Emulation of Matlab package matlab.
- Communication with RDBMS ROracle, RMySql,
RSQLite, RmSQL, RPgSQL, RODBC large data sets,
concurrency - Package filehash a simple key-value style
database, the data are stored on disk but are
handled like data sets - Can use compiled native code in C, C, Fortran,
Java
7R Graphics
- One of the most important strengths of R simple
exploratory graphics as well as well-designed
publication quality plots. - The graphics can include mathematical symbols and
formulae where needed. - Can produce graphics in many formats
- On screen
- PS and PDF for including in LaTex and pdfLaTeX or
for distribution - PNG or JPEG for the Web
- On Windows, metafiles for Word, PowerPoint, etc.
8R Graphics basic and multipanel plots (trellis)
9R Graphics parallel plot and coplot
10R for Time Series
- Package stats
- classical time series modeling tools arima()
for Box-Jenkins type analysis - structural time series StructTS()
- filtering and decomposition decompose() and
HoltWinters() - Package forecast additional forecast methods
and graphical tools - Analyzing monthly or lower frequency time series
- TRAMO/SEATS
- X-12-ARIMA
- accessible through the Gretl library
- Task View Econometrics http//cran.r-project.org/
web/views/Econometrics.html
11R for Time Series Example
- Fitting an ARIMA model to a univariate time
series with arima() and using tsdiag() for
plotting time series analysis diagnostic
12R for Survey Analysis
- Complex survey samples are usually analysed by
specialized software packages SUDAAN, Bascula 4
(Statistics Netherlands), etc. - STATA provides much more comprehensive support
for analysing survey data than SAS and SPSS and
could successfully compete with the specialized
packages
13R for Survey Analysis
- R package survey - http//faculty.washington.edu
/tlumley/survey/ - stratification, clustering, possibly multistage
sampling, unequal sampling probabilities or
weights multistage stratified random sampling
with or without replacements - Summary statistics means, totals, ratios,
quantiles, contingency tables, regression models,
for the whole sample and for domains - Variances by Taylor linearization or by replicate
weights (BRR, jack-knife, bootstrap, or
user-supplied) - Graphics histograms, hexbin scatterplots,
smoothers - Other packages pps, sampling, sampfling
14R and the Outliers (Robust Statistics in R)
- What are Outliers
- atypical observations which are inconsistent with
the rest of the data or deviate from the
postulated model - may arise through contamination, errors in data
gathering, or misspecification of the model. - classical statistical methods are very sensitive
to such data - What are Robust methods
- Produce reasonable results even when one or more
outliers may appear in the data - Robust regression - robustbase
- Robust multivariate methods rrcov, robustbase
- Robust time series analysis - robust-ts
15R and the Outliers Example
- Example Wages and Hours - http//lib.stat.cmu.edu
/DASL/ - a national sample of 6000 households with a male
head earning less than 15,000 annually in 1966 -
9 independent variables classified into 39
demographic groups - estimate y the labour supply (average hours)
from the available data (for the example we will
consider only one variable x average age of
the respondents - We will fit an Ordinary Least Squares (OLS) and a
robust Least Trimmed Squares model
16R and the Outliers Example OLS
17R and the Outliers Example LTS
18R and the Outliers Example Covariance
- Marona Yohai (1998)
- rrcov data set maryo
- A bivariate data set with
- sample correlation 0.81
- interchange the largest and smallest value in the
first coordinate - the sample correlation becomes 0.05
19More R
- R and the WEB - several projects that provide
possibilities to use R over the WEB - R and the Missing advanced missing value
handling - mvnmle ML estimation for multivariate data with
missing values - mitools Tools for multiple imputation of missing
data - mice - Multivariate Imputation by Chained
Equations - EMV Estimation of Missing Values for a Data
Matrix - VIM provides methods for the visualisation as
well as imputation of missing data - R Objects R is an Object Oriented language
(however in a quite different sense from C,
Java, C)
20More R
- R GUI
- R Commander a basic statistics GUI, consisting
of a window containing several menus, buttons,
and information fields - Sciviews a suite of companion applications for
Windows - R and SDMX
- R Reports
- package xtable coerce data to LaTeX and HTML
tables - package Sweave a framework for mixing text and R
code for automatic report gene
21Summary
- Output Management System
- SAS/SPSS it is rarely used for routine work
- R output is easily passed from one function to
another to do further processing and to obtain
more results - Macro Language
- SAS/SPSS a special language with own syntax. The
new functions are not run in the same way as the
built-in procedures - R itself is a programming language
- Matrix Language
- SAS/SPSS A special language with own syntax
- R is a vector and matrix based language
complemented by additional packages Matitrx,
SparseM
22Summary (cont.)
- Publishing results
- SAS/SPSS Cut and paste to a Word processor or
exporting to a file - R produce LaTex output (including graphics)
using for example the Sweave package - Data size
- SAS/SPSS Limited by the size of the disk
- R Limited by the size of the RAM, (not trivial)
usage of databases for large data sets is
possible - Data structure
- SAS/SPSS Rectangular data set
- R Rectangular data frame, vector, list
23Summary (cont.)
- Interface to other programming languages
- SAS/SPSS Not available
- R R can be easily mixed with Fortran, C, C and
Java - Source code
- SAS/SPSS Not available
- R the source code of R itself as well as of its
packages is a part of the distribution
24References
- Hornik, K and Leisch, F, (2005) R Version 2.1.0,
Computational Statistics, 20 2 pp 197-202 - Kabacoff, R. (2008) Quick-R for SAS and SPSS
users, available from http//www.statmethods.net/i
ndex.html - López-de-Lacalle, J, (2006) The R-computing
language Potential for Asian economists, Journal
of Asian Economics, 17 6, pp 1066-1081 - Muenchen, R. (2007), R for SAS and SPSS users,
URL http//oit.utk.edu/scc/RforSASSPSSusers.pdf - Murrel, P. (2005) R Graphics, Chapman Hall
- R Development Core Team (2007) R A Language and
Environment for Statistical Computing, R
Foundation for Statistical Computing, Vienna,
Austria, ISBN 3-900051-07-0. URL
http//www.r-project.org/ - Templ, M and Filzmoser, F (2008), Visualisation
of Missing Values and Robust Imputation in
Environmental Surveys, submitted for publication - Wheeler, D.A., (2007) Why Open Source Software /
Free Software (OSS/FS, FLOSS, or FOSS)? Look at
the Numbers!