Title: Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related than distant things. Drowning in Data yet Starving for Knowledge [Naisbitt -Rogers]
1Spatial Statistics and Spatial KnowledgeDiscovery
First law of geography Tobler Everything is
related to everything, but nearby things are more
related than distant things. Drowning in Data
yet Starving for Knowledge Naisbitt -Rogers
- Lecture 1 Introduction to R
- Pat Browne
2Introduction to programming in R
- R is a computer language and environment that
allows users to program algorithms and use
pre-written packages. R is a free software
environment for statistical computing and
graphics (including mapping). - There are special R-packages for handling and
analyzing spatial data. For example, The sp
package provides classes and methods for points,
lines, polygons, and grids. - R can extract spatial data from PostgreSQL. Also,
R can be combined with SQL using PL/R.
3Installing R
- R for Windows can be downloaded from
- http//ftp.heanet.ie/mirrors/cran.r-project.org/bi
n/windows/base/R-2.14.1-win.exe - See Lab1.doc for installation details.
4Starting R
- We will look at the main features of R, see
lab1.doc for more details. This lecture also
presents an introduction to programming. - The basic components of current languages are
- Data types e.g. Integers, String, Polygon.
- Variables to refer to data types e.g. a lt- 2
- Operations on those data types e.g. area(polygon)
- Control structures e.g. sequence, iteration, and
conditions. - Logic is an important part of programming, but it
is often implicit and external to the language.
Some languages like SQL are quite close to logic.
5Starting R Programs consists of Data,
Operations etc.
- The basic components of current languages are
- Concrete data types e.g. Integer, String,
Polygon. - Variables to refer to data types e.g. a lt- 2
- Operations on those data types e.g. area(polygon)
- Control structures e.g. sequence, iteration, and
conditions. - Logic is an important part of programming, but it
is often implicit and external to the language.
Some languages like SQL are quite close to logic.
6Starting R Variables
- Variables provide a means of accessing the data
stored in computer memory. R provides a number of
specialized data structures or objects (also
called data types). These objects are referenced
in your programs using variables. - Store a lt- 2 Access a
- Store b lt-Pat Access b
- Assigns the variable a the number 2 and the
variable b the string Pat.
7Starting R Data types
- A data type represents a constraint placed upon
the interpretation of data in a type system,
describing representation, interpretation, legal
operations and structure of values. - Data types are a way to limit the kind of data
that can be used by a particular program or
stored in a database table. Types restrict the
data to a certain set of values (e.g. 1,2,3,..for
Integers). - Data types also are restricted to certain
operations on the type (e.g. addition for
Integers). R comes with a range of standard data
types that can be used to represent strings,
integers, real numbers, and dates, but R also has
types that are especially suited to statistics
such as vectors and tables.
8Starting R Data types
The c() function combines its argument into a
vector. In R the term modes is used to describe
data types. There are 4 basic types or modes
numeric, character, complex , and logical. These
can be combined to form collections or what are
called objects in R.
9Starting R Data types (Objects)
10Starting R Data types (Objects)
11Starting R Data types (Objects)
12Starting R Finding data types
13Starting R Data types
- Numbers 1, 1.4.
- Strings ABC or abc
- Vector
- Arrays are vectors plus dimension vector (dim)
- Factors for nominal ordered categorical data
- Data Frames matrix-like for data of different
types - Tables
- One Way Tables
- Two Way Tables
14Starting R Data types- Numbers
- a lt- 3
- b lt- sqrt(aa3)
- List of the defined variables/objects
- gt ls()
- We can add 1 to every element of a list
- gt a lt- c(1,2,3,4,5)
- gt a1
- We can get the mean, variance, and standard
deviation from a list of numbers - gt mean(a)
- gt var(a)
- gt sd(a)
15Starting R Data types- Strings
- gt a lt- "hello"
- gt a 1 "hello"
- gt b lt- c("hello","there")
- gt b 1
- gt b 2
16Starting R Data types-Vector
- R operates on named data structures. The simplest
such structure is the numeric vector, which is a
single entity consisting of an ordered collection
of numbers. To set up a vector named x use the R
command - gt x lt- c(10.4, 5.6, 3.1, 6.4, 21.7)
- gt x2
- Variable assignment can be written as lt- in R.
The above assignment uses the function c() which
can take an arbitrary number of vector arguments
and whose value is a vector got by concatenating
its arguments end to end. - A number occurring by itself in an expression is
taken as a vector of length one.
17Starting R Data types-Arrays
- Arrays are vectors plus the dim attribute
(dimension vector), matrices are arrays with a
dim attribute of length 2. Arrays are ordered
column major order
18Starting R Data types-Matrices
- Arrays are vectors plus the dim attribute
(dimension vector), matrices are arrays with a
dim attribute of length 2. Arrays are ordered
column major order
19Starting R Data types-Tables
- xc("Yes","No","No","Yes","Yes")
- gt table(x)
- x
- No Yes
- 2 3
20Types of Categorical data
- Nominal Mutually exclusive categories
male/female, dead/alive, smoker/non-smoker,
bus/car/train. Tends to be unordered or have no
logical hierarchy - Ordinal Can be ranked in a meaningful order.
Distance between values is not relevant as there
is no distance information race positions (1st,
2nd, 3rd), grouped amounts (1-5, 6-10, 11-15 per
day). Unlike nominal data, ordinal data can be
compared against each other.
21Starting R Data types- Factor
- When looking at the impact of carbon dioxide
(CO2) on the growth rate of a tree you might try
to observe how different trees grow when exposed
to different preset concentrations of CO2. The
different levels are often called categories or
factors. CO2 is measured in parts per million by
volume (ppmv). Levels could be L1 0-3, L23-6,
L36-9, L49-12 ppmv (ignoring double inclusion
of boundaries).
22Starting R Data types- Factor
- Categorical data is often used to classify data
into various levels or factors. For example,
smoking data could be a factor in a broader
survey on health issues. R has a special class
for working with factors, R will adapt itself
when it knows it has a factor. - gt xc("Yes","No","No","Yes","Yes")
- gt factor(x)
- 1 Yes No No Yes Yes
- Levels No Yes
23Starting R Data types- Factor
- We will assume that your data files are stored in
C\My-R-Dir\ - Load in the file tree91.csv.
- tree lt- read.csv(file"C\\My-R-Dir\\trees91.csv"
,headerTRUE,sep",") - The summary operation prints out the possible
values and the frequency that they occur. Find
summary of the chamber identification label
(CHBR) - summary(treeCHBR)
24Starting R Data types- Factor
- summary(treeCHBR)
- Note the output of the summary operation produces
quartiles. A quartile is one of three points
(including the median), that divide a data set
into four equal groups, each representing a
fourth of the distributed sampled population.
25Starting R Data types- Factor
- A nominal value is represented as a factor in R.
The factor stores the nominal values as a vector
of integers in the range 1... k - where k is the number of unique values in the
nominal variable e.g. male1,female2, - and an internal vector of character strings (the
original values) mapped to these integers.
26Starting R Data types- Factor
- Consider variable gender with 20 male entries and
30 female - gender lt- c(rep("male",20), rep("female", 30))
- gender lt- factor(gender)
- Stores gender as 20 1s and 30 2s, where 1female,
2male internally (alphabetically) - R now treats gender as a nominal variable
- summary(gender)
- What does rep() do? How would you find out?
- Type ? rep() into R and see.
27Starting R Data types- Factor
- An ordered factor is used to represent an ordinal
variable. Consider a variable rating coded as
large, medium, small - rating lt- c(rep("large",10), rep("medium",
10),rep("small", 10) ) - rating lt- ordered(rating)
- R codes rating to 1,2,3 and associates 1large,
2medium, 3small internally - R uses factor for nominal variables and ordered
for ordinal variables in statistical procedures
and graphical analyses. - Try the command plot(rating)
28Starting R Data types- Factor
- A factor is a vector object used to specify a
discrete classification (grouping) of the
components of other vectors of the same length. R
provides both ordered and unordered factors. The
application of factors is with model formulae. A
sample of 30 tax accountants from all the states
of Australia by a character vectors as - state lt- c("tas", "sa", "qld", "nsw", "nsw",
"nt", "wa", "wa", "qld", "vic", "nsw", "vic",
"qld", "qld", "sa", "tas", "sa", "nt", "wa",
"vic", "qld", "nsw", "nsw", "wa", "sa", "act",
"nsw", "vic", "vic", "act") - A factor is created using the factor() function
- statef lt- factor(state)
- summary(statef)
- To find out the levels of a factor the function
levels() can be used. - levels(statef) 1 "act" "nsw" "nt" "qld" "sa"
"tas" "vic" "wa"
29Starting R Data types- Matrix
- A matrix is a collection of data elements
arranged in a two-dimensional rectangular layout.
The following is an example of a matrix with 2
rows and 3 columns.
30Starting R Data types- Matrix
- gt A matrix( c(2, 4, 3, 1, 5, 7), the data
elements nrow2, number of ro
ws ncol3, number of columns
byrow TRUE) fill matrix by rows - gt A print the matrix ,1 ,2 ,3 1,
2 4 3 2, 1 5 7 - An element at the mth row, nth column of A can be
accessed by the expression Am, n. - gt A2, 3 element at 2nd row, 3rd column
1 7 - The entire mth row A can be extracted as Am, .
- gt A2, the 2nd row 1 1 5 7
- Similarly, the entire nth column A can be
extracted as A ,n. - gt A ,3 the 3rd column 1 3 7
31Starting R Data types- Dataframe
- A dataframe is more general than a matrix, in
that different columns can have different modes
(numeric, character, factor, etc.). It is a bit
like an SQL table. - d lt- c(1,2,3,4)e lt- c("red", "white", "red",
NA)f lt- c(TRUE,TRUE,TRUE,FALSE)mydata lt-
data.frame(d,e,f)names(mydata) lt-
c("ID","Color","Passed") - There are a variety of ways to identify the
elements of a dataframe . - mydata23 columns 2,3 of dataframe
- mydatac("ID",Color") columns ID,Color
- myframeID name in dataframe
32Starting R Data types- data.frame
- Here we create a data.frame called d.
- L3 lt- LETTERS13
- (d lt- data.frame(cbind(x1, y110),
- facsample(L3, 10, replaceTRUE)))
- To view four rows df14,
- To view a column dy, dy, dfac
- Alternative way to view a column d,3
33Starting R Data types- Table
- One way tables are created with table command,
its arguments are a vector of factors, and it
calculates the frequency that each factor occurs.
34Starting R Data types- one-way Table
- gt a lt- factor(c("A","A","B","A","B","B","C","A","C
")) - gt results lt- table(a)
- gt results
- gta
- A B C
- 4 3 2
- gt attributes(results)
- gtattributes(results) dimnamesa
- gtattributes(results) dim
- gtattributes(results) class
- gt summary(results)
35Starting R Data types- two-way Table
- Say we want to put the results of two questions
into a table - First question responses are Never, Sometimes,
Always, - Second question responses are Yes, No, and
Maybe. Two vectors a and b contain the response
for each measurement. - In the vectors, responses are represented by
position. The third item in a is how the third
person responded to the first question, and the
third item in b is how the third person responded
to the second question. - In the following we can see that two people who
said "Maybe" to the first question also said
"Sometimes" to the second question.
36Starting R Data types- two-way Table
ROW COLUMN
- a lt-
- c("Sometimes","Sometimes","Never","Always","Always
","Sometimes","Sometimes","Never") - b lt- c("Maybe","Maybe","Yes","Maybe","Maybe","No",
"Yes","No") - results lt- table(a,b)
- gt results
- b
- a Maybe No Yes
- Always 2 0 0
- Never 0 1 1
- Sometimes 2 1 1
- The table shows that two people who said Maybe to
the first question - also said Sometimes to the second question.
- The elements are accessed like a matrix
(result(,1). ) - How many people responded?
The third item in a is how the third person
responded to the first question, and the third
item in b is how the third person responded to
the second question.
37Useful functions
- length(object) number of elements or
componentsstr(object) structure of an
object class(object) class or type of an
objectnames(object) namesc(object,object,...)
combine objects into a vectorcbind(object,
object, ...) combine objects as
columnsrbind(object, object, ...) combine
objects as rows
38Useful functions
- object() prints the
objectls(),objects() list current
objectsrm(object) delete an
objectnewobjectlt-edit(object) edit,copy,save - fix(object) edit in place
- data.entry(result) GUI edit in place
- mode(object) type of the object.
-
-
39Starting R Input-Output IO
- There are many ways to data into R. We focus on
just three - Assignment
- Reading a CSV File (writing later)
- Loading data from PostgreSQL (later)
40Starting R IO-Assignment
- Assignment (RHS lt- LHS) allows an expression on
the RHS to be stored in a name object on the LHS.
In R - gt a lt- c(3,5,7,9) gt
- The above assignment uses the combine command.
(c means combine). This makes a vector called a.
No output is produced yet. Now we can retrieve
the contents of a just by typing it in. - gt a
- gt a3
- The command gives all of a the second command
gives the third element of a . 3 is called the
index. The zero entry hold the data type of the a
vector. Try - b lt- c("one","two","three")
41Starting R IO-Assignment
- cells lt- c(1,26,24,68)
- rnames lt- c("R1", "R2")
- cnames lt- c("C1", "C2")
- mymatrix lt-
- matrix(cells, nrow2, ncol2,
- byrowTRUE,
- dimnames
- list(rnames, cnames))
Type gtattributes(mymatrix) Type gthelp(array) to
find more details on Arrays
42Starting R Input File
- Place the file simple.csv in a directory
(folder). - Load the file into R using
- h lt- read.csv(fileC\\My-R-Dir\\simple.csv,head
TRUE,sep,) - View the contents of h
- Now the contents of the file are stored in R as
the object named h. - Type gtnames(h)
43Starting R Data types-Matrices
- All columns in a matrix must have the same data
type (numeric, character, etc.) and the same
length. The general format is - mymatrix lt-
- matrix(vector, nrowr, ncolc,
- byrowFALSE,
- dimnameslist(char_vector_rownames,
- char_vector_colnames))
- byrowTRUE indicates that the matrix should be
filled by rows. byrowFALSE indicates that the
matrix should be filled by columns (the default).
dimnames provides optional labels for the columns
and rows.
44Review - vectors, lists, matrices, data frames
- To make vectors x, y, year, names
- x lt- c(2,3,7,9)
- y lt- c(9,7,3,2)
- year lt- 19901993
- names lt- c("payal", "shraddha", "kritika",
"itida") - Accessing last element
- ylength(y)
- To make a list person
- person lt- list(name"payal", x2, y9, year1990)
- Accessing personname, personx , person1
- names(person)
45Review - vectors, lists, matrices, data frames
- To make a matrix, pasting together the columns
year , x, y using column bind. - m lt- cbind(year, x, y)
- To make a data frame, which is a list of vectors
of the same length - D lt- data.frame(names, year, x, y)
- nrow(D)
- Accessing one of these vectors
- Dnames
- Accessing the last element of this vector
Dnamesnrow(D) - Dnameslength(Dnames)
46Finding the type and class
- gt g lt- c(1,3,2)
- gt class(g)
- 1 "numeric"
- gt typeof(g)
- 1 "double
- gt is(g)
- 1 "numeric" "vector"
47Sorting
- The variable i is a vector of integers, then the
data frame Di, picks up rows from D based on
the values found in i'. The order() function
makes an integer vector which is a correct
ordering for the purpose of sorting. - D lt- data.frame(xc(1,2,3,1), yc(7,19,2,2))
- Sort on x
- indexes lt- order(Dx)
- Dindexes,
- Print out sorted dataset, sorted in reverse by y
Drev(order(Dy)),
48Logical constants variables
- TRUE and FALSE are logical constants
- T and F are logical variables
- T and F are quite not synonyms for TRUE and FALSE
but variables that have the expected values by
default - TRUE TRUE
- T T
- Normally give the expected result.
49Missing Values NA
- Not Available or Missing Values are represented
as NA, which is a logical constant (either T or
F) which contains a missing value indicator. - Examples
- is.na(c(1, NA)) FALSE TRUE
- is.na(c(NA, NA)) TRUE TRUE
- is.na(paste(c(1, NA))) FALSE FALSE
- xx lt- c(04)
- is.na(xx) lt- c(2, 4)
- xx 0 NA 2 NA 4
50Writing your own functions.
- R comes with a built-in median function.
- Usage median(x, na.rm FALSE)
- Where x an object for which a method has been
defined, or a numeric vector containing the
values whose median is to be computed. - na.rm a logical value indicating whether NA
values should be removed before the computation
proceeds.
51Control - If
- gt if (T) print("Hello") else print("Good Bye")
- 1 "Hello"
- gt if (F) print("Hello") else print("Good Bye")
- 1 "Good Bye"
52Control - Sequence
- a lt- c(1,2,3,4,5)
- b lt- c(2,3,4,5)
- odd.even lt- length(a) 2
- if (odd.even 0)
- (sort(a)length(a)/2
- sort(a)1 length(a)/2)/2 else
- sort(a)ceiling(length(a)/2)
- If we want to find the median of b we have to
type the whole thing again. - gt if (odd.even 0) (sort(b)length(b)/2
sort(b)1 length(b)/2)/2 else
sort(b)ceiling(length(b)/2) - It would be better to write a function.
53User Written - Functions
- a lt- c(1,2,3,4,5)
- b lt- c(2,3,4,5)
- mymedian lt- function(x)
- odd.even lt- length(x) 2
- if (odd.even 0)
- (sort(x)length(x)/2
- sort(x)1 length(x)/2)/2 else
- sort(x)ceiling(length(x)/2)
-
- Now we can call, run, execute or invoke my median
on any vector. - gt mymedian(a)
- gt mymedian(b)
54 References
Applied Spatial Data Analysis with R Bivand,
Pebesma, Gómez-Rubio
Lloyd Spatial Data Analysis
http//www.manning.com/obe/
http//www.spatial.cs.umn.edu/Book/