Oswald

About This Presentation

Title:

Oswald

Description:

1530-1600 afternoon tea. 1600-1730 hands-on session. Oswald/Windows. 3. Outline - Day 2 ... 1530-1600 afternoon tea. 1600-1730 free laboratory session. Oswald ... – PowerPoint PPT presentation

Number of Views:128

Avg rating:3.0/5.0

Slides: 206

Provided by: davidm126

Category:

more less

Transcript and Presenter's Notes

Title: Oswald

1
Oswald

for Windows
Lancaster University 7-9 June 1998

MathSoft International Knightway House Park
Street Bagshot, Surrey GU19 5AQ United Kingdom
David Smith S-PLUS Product Manager Phone 44
(0)1276 452299 ext. 212 Fax 44 (0)1276
451224 E-mail dsmith_at_mathsoft.co.uk
2
Outline - Day 1

0915-1045 Intro to S-plus and Oswald
1045-1115 coffee
1115-1245 hands-on session
1245-1400 lunch
1400-1530 Longitudinal data
1530-1600 afternoon tea
1600-1730 hands-on session

3
Outline - Day 2

0915-1045 Exploring longitudinal data
1045-1115 coffee
1115-1245 hands-on session
1245-1400 lunch
1400-1530 Basic modelling
1530-1600 afternoon tea
1600-1730 hands-on session

4
Outline - Day 3

0915-1045 Advanced modelling
1045-1115 coffee
1115-1245 hands-on session
1245-1400 lunch
1400-1530 Other topics
1530-1600 afternoon tea
1600-1730 free laboratory session

5
Day 1

Introduction to S-plus and Oswald

6
Day 1 Session 1

What is S-PLUS?
Oswald overview
Starting S-PLUS for Windows
The S command language
Fundamentals -- expressions and assignments
Vectors, matrices and lists
Calling functions

7
S-PLUS History

1980s S Language developed by ATT Bell Labs
Added value S-plus distributed by StatSci Inc
1992 Mathsoft acquires Statsci, Axum
1997 S-plus 4.0 released
Axum Graphics, S Language, Office GUI
1998 S-plus 4.5 (Windows) S-plus 5.0 (Unix)
released
Best-regarded statistical research tool in the
industry
50,000 users world-wide

8
S-PLUS 4.5For exploratory data analysis and
statistical modelling

From basic statistics to cutting edge statistics
Customisable and extensible GUI
For data analysts and researchers of all levels
Powerful, object-oriented S language
With thousands of built-in functions
Total control of data and graphics
Worldwide S-PLUS users group

9
Book Recommendations

Modern Applied Statistics with S-plus
W.N. Venables and B.D. Ripley
Springer 1997 (second edition)
The New S Language A programming environment for
data analysis and graphics
Becker, Chambers Wilks
Wadsworth Brooks/Cole 1988
Analysis of Longitudinal Data
Diggle, Liang Zeger
Oxford University Press 1994

10
Oswald - Introduction

Object-oriented Software for the Analysis of
Longitudinal Data

11
Longitudinal Data

A (largish) number of subjects
people, households, rats, cows,
A measurement taken repeatedly on each subjects
over a (smallish) number of times
Daily blood-pressure on 65 patients taken over 2
weeks. 30 patients on control drug, 35 on new
anti-hypertensive drug.

12
Oswald features

New data type for longitudinal data
Subsample selection
Creating replicated covariates
Parallel plots
Shadow plots
Available for Windows and Unix

Analysis of longitudinal data
Kernel smoothing
Rice-Silverman C-V
Variogram
Mixed modelling
Dropout modelling
Comes with GEE and ALR

13
Using Oswald

From the S-plus prompt
library(oswald)
The Oswald function and datasets are now
available.

14
Starting S-PLUS
15
(No Transcript)
16
Object Browser

S-plus is object-oriented
data, functions, analysis results, plots, ...
persistence (press DEL to delete)
The Object Browser organises all the objects you
create
Left pane -- objects by classification
Right pane -- list/components of objects
Completely customisable

17
Getting Data

Data Select Data
Select existing data, or import data from a file
Dialog appears at startup (by default)
Import data (later)
Existing data enter the name of an S-plus object
in the Name box and click OK.

18
Data Windows

Data appears in a data window
Like a spreadsheet, but column-oriented
Includes column names and row labels
Click on a column name to select a column
Shift-click to select a block of columns
Ctrl-click for non-adjacent columns
Order of selection is important!
Selected columns used for plots, analysis, ...

19
Simple graphs

Open the 2D or 3D plot palette
Select column(s) for plotting from data window
First ctrl-click -- x axis
Second ctrl-click -- y axis
(Third ctrl-click -- z axis)
Click on the palette button corresponding to the
desired plot

20
Trellis Graphs

You can condition any plot by the value of any
other variable in the data
Create a plot
Select the conditioning column in the data
window, and drag to the title area
Be sure to start dragging from the data
Cursor changes to when ready to drop
Plot is redrawn with panels for subranges of the
conditioning variable

21
Example Ozone data

Select Data
Histogram of Ozone
Scatterplot Temp vs Ozone
With loess line
Trellis conditioning on Wind
3-D plot Temp vs Wind vs Ozone
Vary colour by Radiation
Export graphics

22
The Command Window

Behind all of the menus and toolbars lies the S
command language
A complete programming language
expressions, data types, loops, functions,
object oriented
We will mostly use Oswald by entering commands in
the S language using the Command Window
Click on to open.

23
S Language absolute basics

Enter an expression at the prompt gt
S-plus prints out the result (usually)
gt 11
1 2
gt pi
1 3.141593
gt sin((1sqrt(5))/2pi)
1 -0.9320324

24
Secondary prompt

If you fail to complete the expression, continue
at the prompt
gt exp(1/sqrt(2) -
0.5)
1 1.230114
If you get stuck, enter lots of )s
gt log(1sqrt(5 sin(2)
)))))
Syntax error No opening parenthesis, before ")"
at this point
log(1sqrt(5 sin(2)

25
Vectors

Many functions return vectors
gt rnorm(6)
1 -1.4217905 0.7963978 -0.2487190 -0.8279492
5 0.7495827 1.0976972
The n on the left shows where the row starts.
A single number is a length 1 vector
gt mean(rpois(50, mean5))
1 4.74

26
Making vectors

Sequences ab
gt 110
1 1 2 3 4 5 6 7 8 9 10
gt 5045
1 50 49 48 47 46 45
Concatenation c(v1,v2,v3,)
gt c(6,2,1)
1 6 2 1
gt c(15,51)
1 1 2 3 4 5 5 4 3 2 1

27
Assignment

Store results in objects with lt- (gets)
gt a lt- 110
The result is not printed until you enter the
object name by itself
gt a
1 1 2 3 4 5 6 7 8 9 10

28
Object Names

Valid object names may contain only
Letters abcXYZ
Numbers 0123456789
Dot .
Valid names
y Y weight x.var LD50 .my.dat sum
Invalid names
m-1 heart_rate 9th T F NA

29
Handling objects

List all your objects
gt objects()
1 ".Last.value" ".Random.seed"
3 ".ldats.options" "a
Objects are persistent and remain until removed
(even if you quit S-plus)
gt remove("a")
gt a
Error Object "a" not found
Or, use the Object Browser

30
Objects as variables

Objects can be used in expressions
gt a lt- 15
gt sum(a)
1 15
gt b lt- c(a, 10)
gt length(b)
1 6
gt 2b
1 2 4 6 8 10 20

31
Vector arithmetic

Scalar functions work elementwise
gt a lt- 14
gt sqrt(a)
1 1.000000 1.414214 1.732051 2.000000
Scalar and vector arithmetic
gt 2a
1 2 4 6 8
gt 2a log(a)
1 2.000000 4.693147 7.098612 9.386294

32
Logical vectors

Expressions with relational operators return
logical vectors
ab agt5 blt0 a!0
T is True, F is False
gt a lt- rnorm(5)
gt a
1 -0.1495632 0.1389647 2.3571278 2.1495335
5 -1.8157597
gt a lt 0
1 T F F F T

33
Missing values -- NA

The missing value is represented by NA
gt a lt- c(1, NA, 2)
Most operations on NA return NA
gt a 1
1 2 NA 3
gt sum(a)
1 NA
is.na checks for missing values
gt is.na(a)
1 F T F

Not aNA
34
Vector indexing

Use to select elements of a vector
gt a lt- c(2,3,5,7,11,13,17)
gt a1
1 2
gt a35
1 5 7 11
gt ac(1,3,5)
1 2 5 11
Negative indices remove elements
gt a-(13)
1 7 11 13 17

35
Logical indices

A logical index selects elements
index as long as the indexed vector
gt b lt- rnorm(8)
gt bblt0
1 -0.04100 -0.86600 -0.07229
gt log.b lt- log(b)
Warning messages
NAs generated in log(x)
gt bis.na(log.b)
1 -0.04100 -0.86600 -0.07229
gt log.b!is.na(log.b)
1 0.7242 0.9154 0.6976 -0.1279 0.2408

36
Replacement

You can also use on the left hand side of an
assignment
gt a lt- sample(110)
gt a
1 2 1 10 9 7 8 3 6 4 5
gt a5 lt- NA
gt a
1 2 1 10 9 NA 8 3 6 4 5
gt ais.na(a) lt- 0
gt a
1 2 1 10 9 0 8 3 6 4 5

37
Calling functions

Functions are called like this
function.name(argument, argument, ...)
Functions always return a value
NULL represents no value
Function arguments have a
position (first, second, )
name
default value (sometimes)

38
Example rep

rep(x, times inferred, length.out
inferred)
gt rep(13,2)
1 1 2 3 1 2 3
gt rep(13,length8)
1 1 2 3 1 2 3 1 2
gt rep(13,c(3,2,1))
1 1 1 1 2 2 3
gt rep()
Error in rep Argument "x" is missing, with no
default rep()
Dumped

39
Example seq

seq(from1, toend, by1,
lengthinferred, alongNULL)
gt seq(1,5)
1 1 2 3 4 5
gt seq(10, 20, length6)
1 10 12 14 16 18 20
gt seq(to100, by15, length7)
1 10 25 40 55 70 85 100
gt seq(length10)
1 1 2 3 4 5 6 7 8 9 10

40
Matrices

matrix(dataNA, nrowinferred, ncolinferred,
byrowF, dimnamesNULL)
gt X lt- matrix(110, nrow2)
gt X
,1 ,2 ,3 ,4 ,5
1, 1 3 5 7 9
2, 2 4 6 8 10
gt dim(X)
1 2 5

41
Matrix indices

gt X2 lt- matrix(115, nrow3, byrowT)
gt X22,3
1 8
gt X212,
,1 ,2 ,3 ,4 ,5
1, 1 2 3 4 5
2, 6 7 8 9 10
gt X223, 35
,1 ,2 ,3
1, 8 9 10
2, 13 14 15
gt X2,1
1 1 6 11

gt X2,1,dropF ,1 1, 1 2,
6 3, 11
42
Transpose and multiply

Transpose t(X)
gt X lt- cbind(rep(1,5),15)
gt t(X)
,1 ,2 ,3 ,4 ,5
1, 1 1 1 1 1
2, 1 2 3 4 5
Matrix multiply
gt t(X) X
,1 ,2
1, 5 15
2, 15 55

43
Matrix algebra

Matrix inverse solve
gt solve(t(X) X)
,1 ,2
1, 1.1 -0.3
2, -0.3 0.1
Decompositions
Eigenvector (eigen)
Singular value (svd)
Cholesky (chol)

gt eigen(t(X)X) values 1 1.1831
0.0169 vectors ,1 ,2 1,
0.9637 0.2669 2, -0.2669 0.9637
44
Lists

An ordered collection of arbitrary objects
gt mylist lt- list(dat15, name"short",
locc(1.0,-2.5))
gt mylist
dat
1 1 2 3 4 5
name
1 "short"
loc
1 1.0 -2.5

gt mylistname 1 "short" gt mylist3 1 1.0
-2.5 gt names(mylist) 1 "dat" "name" "loc"
45
Getting help

Help on functions
?sample
help(sample)
General help Help menu
Function index, keyword search
On-line manuals (Users/Programmers/Statistics)
Visual demo

46
Stop!

To cancel the current command, press ESC
Cancel long outputs
Abort long computations
To quit from S-plus (command prompt)
gt q()
Or File Exit from menu bar

47
Day 1 Session 2

Practical
S-plus tutorial

48
Day 1 Session 3

Reading data and data import
Data frames and factors
ldframe objects
adding columns
sub-selections
summary
ssv/tsv

49
Reading in data

File Import Data From File
Choose file and file type from dialog

50
Data import options

Select the Options tab for import options
By default, ASCII files are assumed to have
column labels on row 1 (blank rows ignored)
data from row 2 onwards
columns separated by whitespace or comma
Change defaults with import options
First data line, separators, column spec
NB .dat files assumed to be Gauss files!

51
Data objects

When data are imported, the data are displayed in
a data window and a new data object is created
data frame
The name of the data object is chosen from the
file name, or can be specified in the import
dialog.

52
Data Frames

data.frame
usually used for data
Like a matrix with named columns
columns may be of different types
rows have labels

gt catalyst
Temp Conc Cat Yield
1 160 20 A 60
2 180 20 A 72
3 160 40 A 54
4 180 40 A 68
5 160 20 B 52
6 180 20 B 83
7 160 40 B 45
8 180 40 B 80

53
Data frame matrix?

You can operate on a data frame as if it were a
matrix
gt catalyst15,-3
Temp Conc Yield
1 160 20 60
2 180 20 72
3 160 40 54
4 180 40 68
5 160 20 52

But numeric functions usually dont work
gt mean(catalyst,-3)
Error in as.double Cannot coerce mode list to
double .Data list(..
Dumped
gt mean(
data.matrix(catalyst,-3))
1 22.42

54
Data frame list

A data frame is actually a list whose components
are all of equal length
gt names(catalyst)
1 "Temp" "Conc" "Cat" "Yield"
gt catalystConc
1 20 20 40 40 20 20 40 40
gt catalystnewcol lt- rnorm(8)

55
Data frame summary

gt summary(catalyst)
Temp Conc Cat
1604 204 A4
1804 404 B4
Yield newcol
Min. 45.0 Min. -0.9880
1st Qu.53.5 1st Qu.-0.3780
Median 64.0 Median -0.1610
Mean 64.2 Mean -0.0075
3rd Qu.74.0 3rd Qu. 0.2890
Max. 83.0 Max. 1.3700

56
Factors

A factor is a vector whose elements are
categories (levels)
Used for categorical data in modelling
To check class of object class(object)
gt f lt- as.factor(str) convert string to factor
gt str lt- as.character(f) convert factor to
string
gt val lt- as.numeric(as.character(f))
convert factor with numeric levels to numbers

57
Factor example

gt class(catalystConc)
1 "factor"
gt levels(catalystConc)
1 "20" "40"
gt catalystConc2 lt- 30
Warning messages
replacement values not all in levels(x) NA's
generated in "lt-.factor"(.A0, 2, value .A1)
gt catalystConc
1 2 3 4 5 6 7 8
20 NA 40 40 20 20 40 40

58
Longitudinal data

We have
m subjects, each with a subject id
subject ids are usually 1, 2, 3, , m
ni events for each subject i1,,m
responses, covariates
each event occurs at a particular time tij
We will use a rectangular data structure with one
row per event.

59
Rat bodyweight data

gt rats.ldf
y Time Subject group
1 57 1 1 Control
2 86 2 1 Control
3 114 3 1 Control
4 139 4 1 Control
5 172 5 1 Control
6 60 1 2 Control
51 59 1 11 Thyroxin
52 85 2 11 Thyroxin
134 104 4 27 Thiouracil
135 122 5 27 Thiouracil

60
ldframe objects

Like data frames, but required to have these two
columns (at least)
Time the time point of the event.
index number (1,2,3, )
days since start of study
weeks since calving
Subject the subject id code
identifies which events belong to which subject.
all observations on each subject must appear in a
contiguous block

61
ldframe object -- data

Other columns are used for data
responses
its best to use the first column for the most
important response
covariates
constant covariates (e.g. Sex) are simply
repeated for each event within each subject
These columns can have any (legal) name

62
Converting data.frame to ldframe

Convert data frame to ldframe with as.ldframe
my.ldf lt- as.ldframe(my.df)
If the data frame does not have Subject and/or
Time columns, specify them here.
names(Orthodont)
1 "Subject" "Sex" "age" "distance"
Orthodont.ldf lt- as.ldframe(Orthodont,Timeage)

63
Two-way indexing

, indexing as for matrix or data.frame
gt rats.ldf12,
y Time Subject group
1 57 1 1 Control
2 86 2 1 Control
gt mean(rats.ldf,"y")
1 100.8
gt rats.ldf123,"group"
1 Thiouracil
Levels
1 "Control" "Thiouracil" "Thyroxin"

64
Column operations

Select columns using list () notation
gt gp lt- rats.ldfgroup
gt table(gp)
Control Thiouracil Thyroxin
50 50 35
Add or delete columns using notation
gt rats.ldfbodymass lt- newdata
gt rats.ldfy lt- NULL

65
Examples

First 3 observations from all subjects
gt rats.ldfrats.ldfTimelt3,
All from the Thyroxin group
gt rats.ldfrats.ldfgroupThyroxin,
Large responses from the Control group
gt rats.ldfrats.ldfgroup"Control"
rats.ldfygt170,
y Time Subject group
5 172 5 1 Control
10 177 5 2 Control
15 185 5 3 Control

66
Occurrences -- nths

nths returns an index vector to nth within-block
occurrences in a block-sorted vector
nths(n, vec)
vec is the vector to index
n1 first occurrences
n13 first three
n-1 last occurrence
nc(1,-1) first and last

67
Example

gt vec
1 a a a b b b b b c c c c c c d d
gt nths(1,vec)
1 1 4 9 15
gt nths(13,vec)
1 1 2 3 4 5 6 9 10 11 15 16
gt nths(-1,vec)
1 3 8 14 16
gt nths(c(1,-1),vec)
1 1 3 4 8 9 14 15 16

4
9
15
68
Practical example

gt idx lt- nths(c(1,-1),rats.ldfSubject)
gt rats.ldfidx,
y Time Subject group
1 57 1 1 Control
5 172 5 1 Control
6 60 1 2 Control
10 177 5 2 Control
11 52 1 3 Control
15 185 5 3 Control
16 49 1 4 Control
20 164 5 4 Control
...

69
Selecting by subject

Use a single index to select all events for
selected subject(s)
gt rats.ldf1
y Time Subject group
1 57 1 1 Control
2 86 2 1 Control
3 114 3 1 Control
4 139 4 1 Control
5 172 5 1 Control

70
More than one subject

gt rats.ldfc(10,27)
y Time Subject group
46 57 1 10 Control
47 82 2 10 Control
48 110 3 10 Control
49 139 4 10 Control
50 169 5 10 Control
131 53 1 27 Thiouracil
132 72 2 27 Thiouracil
133 89 3 27 Thiouracil
134 104 4 27 Thiouracil
135 122 5 27 Thiouracil

71
Selecting by subject id

To select by subject id, use quotes
gt rats.ldf.rev lt- rats.ldf271
gt rats.ldf.rev1,
y Time Subject group
131 53 1 27 Thiouracil
gt rats.ldf.rev"15"12,
y Time Subject group
71 57 1 15 Thyroxin
72 72 2 15 Thyroxin

Note row labels stay with the correct event even
after the data has been reordered.
72
Bad ordering

Selecting so that subjects arent in blocks
anymore
gt rats.perm lt- rats.ldfsample(1135),
Warning messages
Selection resulted in invalid ldframe object
converted to data.frame in ".ldframe\
"(rats.ldf, sample(1135), )
gt rats.perm13,
y Time Subject group
30 153 5 6 Control
8 123 3 2 Control
29 131 4 6 Control
gt rats.perm lt- rats.ldfsample(127)

73
Replication

Some variables may be subject-specific
Constant (replicated) within a subject id
e.g. Sex, Treatment Group
Some variables may be time-specific
Constant (replicated) within a time point
e.g. daily temperature, regression dummy variables

74
ldframe summary

gt summary(rats.ldf)
Longitudinal data frame (ldframe) object
Number of subjects 27
Number of time points 5
Total events 135
Distribution of number of observations per
Subject Time
527 275

Subject-specific variables
group
Control 10
Thiouracil10
Thyroxin 7
Time-specific variables none
Non-replicated variables
y
Min. 46
1st Qu. 71
Median 100
Mean 101
3rd Qu.122
Max. 189

75
ssv

ssv(ldframe)returns all subject-specific columns,
condensed into one-per-subject
gt ssv(rats.ldf)
Subject group
1 1 Control
17 17 Thyroxin
18 18 Thiouracil
27 27 Thiouracil

76
selective ssv

To select a specific subject-specific column
gt gps lt- ssv(rats.ldf,"group")
gt data.class(gps)
1 "data.frame"
gt table(gpsgroup)
Control Thiouracil Thyroxin
10 10 7
gt ssv(rats.ldf,"y")
NULL
Warning messages
Variable y exists, but is not
subject-specific in ssv(rats.ldf, "y")

77
Adding columns with ssv

ssv(obj,"sex") lt- dat
gt sex lt- rep(c("m","f"),c(10,17))
gt ssv(rats.ldf,"sex") lt- sex

ldframe object
new col name
vector as long as number of subjects in obj
78
ssv results

gt rats.ldfc(1,27)
y Time Subject group sex
1 57 1 1 Control m
2 86 2 1 Control m
3 114 3 1 Control m
4 139 4 1 Control m
5 172 5 1 Control m
131 53 1 27 Thiouracil f
132 72 2 27 Thiouracil f
133 89 3 27 Thiouracil f
134 104 4 27 Thiouracil f
135 122 5 27 Thiouracil f

79
Another example

gt span lt- function(x) xlength(x)-x1
gt growth lt- tapply(rats.ldfy, rats.ldfSubject,
span)
gt growth
1 2 3 4 5 6 7 8 9 10 11 12
115 117 133 115 95 107 90 91 91 112 122 84
13 14 15 16 17 18 19 20 21 22 23 24 25 26
133 118 87 88 119 68 63 80 63 89 68 52 82 61
27
69
gt ssv(rats.ldf,"growth") lt- growth

80
Time-specific variables

tsv is like ssv for time-specific variables
varies within subject
constant for each time point
Sometimes data are unbalanced -- each subject
measured at different times
in this case tsv requires a value for each unique
time point

81
tsv example

gt tsv(rats.ldf, "temp") lt- c(21,20,25,28,35)
gt rats.ldf1011
y Time Subject group growth temp
46 57 1 10 Control 112 21
47 82 2 10 Control 112 20
48 110 3 10 Control 112 25
49 139 4 10 Control 112 28
50 169 5 10 Control 112 35
51 59 1 11 Thyroxin 122 21
52 85 2 11 Thyroxin 122 20
53 121 3 11 Thyroxin 122 25
54 146 4 11 Thyroxin 122 28
55 181 5 11 Thyroxin 122 35

82
Help Files

data.frame.object
read.table
ldframe
as.ldframe
ldframe.object
nths
summary.ldframe
ssv
tsv

83
Day 1 Session 4

Practical
Read in data
get summaries
create subsets of data
using tsv/ssv
complex subsetting -- nths

84
Day 2

Exploring longitudinal data

85
Day 2 Session 1

Plotting
Smoothing
Cross-validation
Object-orientation

86
Plotting longitudinal data

Goals
Show how response varies with time
Show variation within subject across time
Show variation between subjects
Show variation between groups (covariates)
Highlight unusual observations missing data
Problem too much data!

87
Scatterplot?

Scatterplot of yij versus tij

88
Basic parallel plot

Create a scatterplot of yij versus tij
Connect observations within each subject i with
lines
"traces"
Shows overall variation
Allows observing within-subject behaviour by
following lines

89
Basic plot

The standard S-plus plot function creates a
parallel plot for ldframe objects
gt plot(rats.ldf)
NB plots created from the command line cannot
(easily) be modified using the GUI

90
Milk Protein Data

79 cows
Three treatment levels (group)
Barley diet
Lupins diet
Mixed (mixture of barley and lupins)
Response y milk protein content
Time is weeks since calving (up to 19)
Missing data

91
milk.ldf

gt idx lt- nths(c(12,-(12)),milk.ldfSubject)
gt milk.ldfidx,c(78,4)
y Time Subject group
1464 3.67 1 78 Lupins
1465 3.26 2 78 Lupins
1481 3.35 18 78 Lupins
1482 3.09 19 78 Lupins
58 3.66 1 4 Barley
59 3.50 2 4 Barley
75 3.28 18 4 Barley
76 NA 19 4 Barley
gt plot(milk.ldf)

92
Highlighting by group

Subject-specific columns can be used to
differentiate subjects by colour, line style, or
plotting character.
gt plot(milk.ldf, colorgroup)
gt plot(milk.ldf, linegroup)
gt plot(milk.ldf, pchargroup,
generallist(type"b"))
Combine color line pchar for 2 or 3-way
discrimination.

93
Milk data plot

gt plot(milk.ldf, colorgroup)
Colours chosen according to levels of group
Missing data highlighted
Dashed lines for intermittent missing values
Red crosses for dropouts

94
Adding a legend

Use the legend argument to add a legend to the
plot
gt plot(milk.ldf, colorgroup, legendlocator(1))
The legend is created automatically using
information from the plot specification.

95
Shadow plots

The previous plot is very "busy"
difficult to follow individual trails
difficult to distinguish between-group behaviour
Idea plot most traces as light "shadows" to
reduce clutter but maintain impression of
variation.

96
Shadow plot

The n.full argument specifies the number of
traces to plot in full
chosen at random
the remainder as shadows
use number between 0 and 1 for percentage
plot(milk.ldf, colorgroup,
n.full0.15)

97
Selector function

When n.full is used, the full traces are selected
at random.
Alternative select an "even spread" according to
some statistic of the trace
The selector argument may be used to specify a
function to calculate the statistic
plot(milk.ldf, n.full12, colorgroup,
selectorfunction(y) y1)

98
Selector notes

The traces with the minimum and maximum
statistics are always chosen
If the selector function returns NA, that trace
will never be chosen
So be careful when using functions like mean as
the selector function
If there are ties in the statistics, traces are
randomly selected within ties

99
Example selector functions

selector
function(y) mean(y, na.rmT)
function(y) max(y, na.rmT)
function(y) diff(range(y, na.rmT))
function(y) sum(!is.na(y))
function(y) rev(y!is.na(y))1

100
Overlaying plots

When used with ldframe objects, lines and points
add to the current plot
lines -- parallel plot
points -- scatterplot
plot(milk.ldf, n.full0)
lines(milk.ldfc(1,27,76),
colorgroup)

101
Different response

By default, plot.ldframe uses the first column of
the ldframe as the response
To use a different response, supply an expression
as the second argument
You can use any of the column names in the
expression
plot(rats.ldf, log(y), colorgroup)
plot(my.ldf, var2)

102
Different x variable

By default, plot.ldframe uses the Time column as
the x variable
To use a different x variable supply an
expression for argument 3 (x.expr)
plot(rats.ldf, x.exprlog(Time))
plot(cog.ldf, x.exprAge)
plot(mydat, x.exprjitter(Time))

103
Customization

There are many optional arguments to plot.ldframe
to change the appearance of the plot.
These can be included in the call to plot, or use
tsopts to make permanent changes
See ?tsopts for a complete list
plot(milk.ldf, colorgroup, colsc(8,5,6))
tsopts(colsc(8,5,6))

104
general

A list of options to be passed to the standard
plot function.
generallist(xlab"Hours")
generallist(type"o", cex2)
generallist(main"Rat Data")
gt plot(milk.ldf, generallist(type"o", pch""))
See ?plot, ?par

105
full

A list of graphics parameters to use when
plotting traces in full (see ?par)
fulllist(lwd2)
fulllist(lty3, pch"o")
fulllist(col6)
gt plot(rats.ldf, fulllist(lwd2, col3))
Also shadows for shadow traces.

106
cols, pchs, ltys

Used when the color, line, or pchar arguments are
used in plot
cols colours to cycle through
pchs plotting characters
ltys line types
tsopts(pchs115)
plot(rats.ldf, pchargroup,
generallist(type"b"), legendlocator(1))

107
Highlighting missing values

miss.style graphics parameters to use when
drawing a line between two points in a trace
separated by missing value(s)
dropout.style graphics parameters to use when
plotting the last point in a trace terminated by
missing value(s). Also dropin.style
tsopts(dropout.styleNULL)

108
Mean vs time

How best to observe changes in sample mean over
time?
Possibly within several treatment groups
Within-time-point means?
What if data are unbalanced?
What if there is too much variability?
What about correlation?

109
Kernel Smoothing

Assume a single group for the moment.
Basic idea estimate mean at time t using a
weighted average of points on all subjects
observed near t
Bigger weights nearer to t
Weights defined by a kernel function K(t)

110
Kernel function
K(t)
h
t

The "width" of the kernel function is specified
by a bandwidth h

111
Kernel estimation

Model yi(t) µ(t) ?i(t)
Data (yij, tij)
weight for yij w(t, tij, h) K((t-tij)/h)

112
ksmooth.lda

Simplest specify ldframe and bandwidth
sm1 lt- ksmooth.lda(milk.ldf, band3.5)
plot(sm1)
Uses a Normal kernel by default
Returns an ldframe with a single subject
smooth calculated at all unique time points by
default

113
The bandwidth

If the bandwidth is large, the kernel estimate
will be very smooth
A small bandwidth will reflect local behaviour,
and be "rougher"
A zero bandwidth will calculate the average of
any points which exists at the time point.

114
Bandwidth example

Pass a vector to bandwidth to calculate many
smooths at once.
sm2 lt- ksmooth.lda(milk.ldf,
bandc(0,2,3.5, 10, 100))
plot(sm2, Smooth1, fulllist(col1))
lines(sm2, Smooth2, fulllist(col2))
lines(sm2, Smooth3, fulllist(col3))
lines(sm2, Smooth4, fulllist(col4))
lines(sm2, Smooth5, fulllist(col5))

115
More points

The kernel smooth can be calculated at any time
point
Specify time points to calculate at with the
x.points argument
sm3 lt-ksmooth.lda(milk.ldf, band3.5,
x.pointsseq(-1,22,length1000))
plot(sm3)

116
Smoothing within groups

Often we use smoothing to look at differences
between treatment groups
Specify the groups with the by argument
A factor variable classifying the data points
into subgroups (eg treatment group)
The result ldframe will have as many traces as
there are levels in by

117
Within-groups example

Calculate smooths within each of the three diets
for the milk protein data, and plot the smooths
with the raw data.
sm4 lt- ksmooth.lda(milk.ldf,
band3.0, bygroup)
plot(milk.ldf, n.full0)
lines(sm4, colorSubject,
legendlocator(1))

118
Changing the kernel function

You can specify any kernel function with the
kernel argument
A scalar valued function of one argument
Default kernels kernel.knormal,
ksmooth.kbox,ksmooth.ktriangle.
All have quartiles at 0.25
The choice of kernel function isn't nearly as
important as the choice of bandwidth

119
Kernel example

Create a new kernel function and plot it
mykernel lt- function(u)
ifelse(abs(u)lt1, 1, 1/abs(u))
x lt- seq(-2,2,length100)
plot(x,mykernel(x),type"l")
Use it on the milk protein data
plot(ksmooth.lda(milk.ldf, bygroup,
band0.5, kernelmykernel))

120
Selecting the bandwidth

The Rice-Silverman Criterion
Set bandwith at h
Remove subject i from the data
Calculate the kernel smooth
Add up the squared differences between subject i
and the kernel smooth
Sum over all subjects
Minimise over h to determine bandwidth

121
ksmooth.rsc

ksmooth.rsc works like ksmooth.lda, but returns
the criterion value instead
gt ksmooth.rsc(milk.ldf, band1.5)
bandwidth RSC
1, 1.5 132.3614
gt rsc lt- ksmooth.rsc(milk.ldf,
bandseq(0,4,length30))
gt plot(rsc)

122
Grouped RSC

If you use the by argument with ksmooth.rsc, you
get the sum of the criterion values within the
subgroups.
rsc lt- ksmooth.rsc(milk.ldf, bygroup,
bandseq(0,2,length30))
plot(rsc)

123
For more information

?plot.ldframe
?tsopts
?ksmooth.lda
?ksmooth.rsc
Why plot.ldframe?

124
Object-oriented

S-plus is object oriented
Some objects have a class which identifies the
type of object.
gt class(milk.ldf)
1 "ldframe" "data.frame"
milk.ldf is of class ldframe, which inherits from
class data.frame

125
Method functions

Some "generic" functions behave differently
depending on the class of the first argument
plot, summary, print, operators,
You call plot(milk.ldf)
S runs plot.ldframe(milk.ldf)

126
Day 2 Session 2

Practical -- plotting and smoothing

127
Day 2 Session 3

Model specification
OLS models -- lm
Missing data issues -- na.add

128
Linear Models

Basic Model
Y X? ?

129
Models in S-plus

formula notation
response predictor predictor
response vector
predictors vector, factor
Intercept term (1) automatically included unless
specifically excluded

130
Model formulae
131
Y 1
Response
Time
132
Y Time
Response
Time
133
Y Time Time2
Response
Time
134
Y poly(Time,3)
Response
Time
135
Y group Time
Response
Time
136
Y groupTime
Response
Time
137
More formula examples

y group - 1
y group - 1 Time
log(y) sex group/Time
log(y) sex group/Time - group
y poly(Time,3) sexgroup
y group/ns(Time,4)
(4 df natural splines within each group -- see
also bs)

138
Factor contrasts

By default, S-PLUS uses Helmert contrasts
options(
contrastsc("contr.treatment","contr.poly"))
Intercept -- level 1 estimate
Factor g estimate -- Level g - Level 1

139
Ordinary Least Squares

Although we know there is within-subject
correlation, sometimes useful to ignore it
The clever ostrich
Point estimates of parameters consistent and
usually fairly efficient
Standard errors can be very wrong (small)
but possible to get better estimates

140
Fitting OLS models lm

Basic usage
obj lt- lm(formula, datadataframe)
formula is a model formula
dataframe is the name of a data frame in which to
find the formula variables
obj is an object to store the results

141
Model results

You can print or extract various components of
the result object
obj brief details
summary(obj) more details
coef(obj) estimated coefficients
anova(obj) AOV table
fitted(obj) fitted values
resid(obj) residuals
plot(obj) diagnostic plot

142
lm example

fit lt- lm(log(y) group
poly(Time,2), datarats.ldf)
fit
coef(fit)
summary(fit)
anova(fit)
plot(rats.ldf, fitted(fit))
plot(rats.ldf, resid(fit))
plot(fit)

143
Model comparison

You can compare two model objects using
anova(obj1,obj2)
for lm this gives an AOV table with an F test for
each term (not parameter)
fit2 lt- lm(log(y)
group poly(Time,3), datarats.ldf)
anova(fit1,fit2)

144
Prediction

predict(obj) return fitted values
predict(obj,newdata) make predictions
newdata is a data frame with new covariate values
to predict at
WARNING predict gives wrong answers when the
parameterisation depends on the covariate values
(eg poly)
use predict.gam instead (safe prediction)

145
predict example

newdata lt- ldframe(Subjectrep(13,rep(100,3)))
tsv(newdata,"T2") lt- seq(1,6,length100)
newdataTime lt- newdataT2
ssv(newdata,"group") lt- levels(rats.ldfgroup)
plot(rats.ldf, log(y),
generallist(xlimc(1,6)))
lines(newdata, predict.gam(fit2,newdata),
colorSubject)
lines(newdata, predict(fit2,newdata),
colorSubject) wrong!!

146
coef example

Lets look at variation at residual slope and
intercept across subjects
resfit lt- lm(resid(fit)
Subject/Time - 1, datarats.ldf)
co lt- coef(resfit)
hist(co127)
hist(co-(127))

147
Missing data

If the response or any covariates contain NAs, lm
will fail with an error message
gt lm(y group, datamilk.ldf)
Error in function(object, ...) missing values
not allowed found in y
Dumped
Include the option na.actionna.omit to delete
all rows with missing values before doing
calculations

148
na.omit example

gt fit3lt-lm(ygroup, datamilk.ldf,
na.actionna.omit)
gt fit3
Call
lm(formula y group, data milk.ldf,
na.action na.omit)
Coefficients
(Intercept) group1 group2
3.42467 -0.1097727 0.002512743
Degrees of freedom 1337 total 1334 residual
Dropped 164 cases due to missing values
Residual standard error 0.3198013

149
na.omit a problem

na.omit works by temporarily deleting rows with
missing data from the data frame
lm thinks there are fewer rows than there are
resid and fitted return vectors as long as the
number of non-missing rows
gt plot(milk.ldfy, resid(fit3))
Error in plot Lengths of x and y must match
gt length(milk.ldfy)
1 1501
gt length(resid(fit3))
1 1337

150
na.add a solution

You can use na.add to add a shortened vector to
an ldframe, expanding it by inserting missing
values where needed.
na.add(milk.ldf, resid) lt- resid(fit3)
milk.ldf41719,
y Time Subject group resid
74 3.35 17 4 Barley -0.1819294
75 3.28 18 4 Barley -0.2519294
76 NA 19 4 Barley NA

151
na.add example

Milk data 4df splines within each treatment
group.
a lt-lm(y group/ns(Time,4), datamilk.ldf,
na.actionna.omit)
af lt- fitted(a)
na.add(milk.ldf, "spline") lt- af
plot(milk.ldf, spline, colorgroup)

152
na.omit an alternative

Alternatively, you can simply use na.omit to
remove all rows with missing data from your data
before you begin.
milk.nona.ldf lt- na.omit(milk.ldf)
Not always appropriate
Oswald functions work best with missing data left
in.

153
Day 2 Session 4

Practical
Fitting models
Getting coefficients
Making predictions
Robust SEs

154
Day 3

Analysing Longitudinal Data

155
Day 3 Session 1

Variogram
Parametric correlation models
Likelihood ratio testing
Structured random effects

156
The variogram

Assume our data are replications of a stationary
process Y(t)
The variogram is used to explore the correlation
structure

157
Variogram

When u is small
?(u)gt0 -- Measurement error
When u is large
if ?(u)ltlt?2, no independence even at large lags
random intercept

158
Creating stationarity

Before we can calculate the variogram, need to
de-trend the data
Calculate the most complex model imaginable, and
take residuals
For balanced data, residuals from saturated
treatmenttime model
For unbalanced data, residuals from kernel smooth
by treatment group

159
Stationarity -- balanced data

Kernel smoothing with zero bandwidth calculates
means within group at each time
ksmooth.res returns residuals
res lt- ksmooth.res(milk.ldf, bygroup, band0)
plot(res, n.full10, colorssv(milk.ldf)group)
Also works with "slightly" unbalanced data

160
Stationarity -- unbalanced data

Choose the most complex classification
Choose bandwidth with RSC
Kernel smooth and take residuals with ksmooth.res

161
Sample variogram

Take stationary residuals rij
Calculate vijk½(rij- rik)2, uijktij- tik
Variogram estimate v(u) at lag u is average of
all vijk where uijk is near u
"Near" means within a binwidth eps
Also estimate process variance ?2
average ½(rij- rlk)2 for i? l

162
Calculating the variogram

res lt- ksmooth.res(milk.ldf,
bygroup, band0)
vg lt- variogram(res)
plot(vg)
The variogram is calculated at all unique lags at
least 1 binwidth apart
Change the binwidth with the eps argument

163
Variogram cloud

You can also calculate and plot all the vijk's by
including the cloudT argument
vg2 lt- variogram(res, cloudT)
plot(vg2)
The plot includes the cloud as a scatterplot
Warning there can be many cloud points, which is
why the cloud is not returned by default

164
Variogram interpretation

Intercept
measurement error
Process variance - stable value at large lags
random intercept
if it doesnt stabilise, not stationary!
Rate of increase to stable value
serial correlation

165
Mixed Effects Modelling

Sometimes, it makes sense to include random
coefficients in our model
random intercept
random slope
This is useful in many ways
more natural models
more efficient parameterisation
explains correlation

166
Mixed effects examples

Random intercept
Yij (? Ui) ?tij ?ij
Random slope
Yij (? Ui) (? Si) tij ?ij
Multilevel effects
Yij (? Ui) (? Si) tij Gg ?ij

-- no can do!
167
Structured random effects

Y X?Z? ?
Random effects ? Var(?)G
Serial correlation ? Var(?)R
Random effects zero-mean, Normal
include fixed effects for nonzero mean
? realised independently for each subject
2-level model (multilevel gt 2 not possible)

168
Basic lme usage

How to specify, Z, G, R?
lme(fixed, random, cluster, data)
fixed linear model (X)
random RE model (Z)
cluster independent units
usually cluster Subject
data ldframe (or dataframe)
Other options for specifying G, R, ML/REML,
(see ?lme.formula)

169
Random effect design

Specify random model formula for Z
random 1 random intercept
random Time random slope intercept
NB no LHS to formula!
By default, all fixed components included in
random model.
All random effects assumed correlated
general (unstructured) form for G

170
lme -- rats

NB lme does not allow within-formula
transformations
rats.ldflogy lt- log(rats.ldfy)
Parallel quadratics, random intercept
lme(logy group poly(Time,2), 1,
clusterSubject, datarats.ldf)

171
lme -- orthodontic data

Orthodont.plot()
orth1 lt- lme(fixed distance age,
random age, cluster Subject,
data Orthodont)
orth1
summary(orth1)

172
Predictions

Population predictions
assuming ?0 (i.e. EY)
Cluster predictions
EYi Zmode(?ik Y)
predict(obj, newdata, cluster)
predict(orth1, Orthodont, Subject)

173
Diagnostic plots

lme provides 3 types of diagnostic plot
scatterplot matrix of predicted REs
plot(orth1,optionc)
Residual plots
plot(orth1, optionr)
response vs fitted values
residuals vs fitted values
residuals by cluster
Standardised residual plots
plot(orth1, options)

174
RE covariance structures

By default, all between-random-effect covariances
are estimated independently
unstructured G
With many random effects, a simpler structure may
be preferred.

175
re.structure

re.structureunstructured
independently estimate all correlations (default)
re.structurecompsymm (compound symmetry)
all variances and covariances equal
re.structurediagonal
independent, different variances
re.structureidentity
independent, common variance
re.structurear1
common variance, AR(1) correlation

176
More complex structures

re.block can be used to group random effects into
independent blocks
re.block list(c(1,2),3)
re.structure is given as a vector for the
strucure of each block
re.structurec(compsymm,identity)
Or, you can design your own structures by writing
an S function

177
Serial correlation

Structure for Var(?i) (and therefore R)
serial.structure
identity (constant variance, default)
ar1 (AR(1), unit spacing)
compsymm
ar2, ma1, ma2, arma11
DIY
Dont overparameterise!!

178
Balanced or unbalanced?

Random effects model
no problem here
Serial correlation model
identity, compsymm OK
ar1 etc. assume unit time spacing by default
lme(, serial.structurear1,
serial.covariate Time, serial.covariate.trans
formationnone)

179
Diggle 3-component

Measurement error variance ?2
Random intercept variance ?2
Random process Wi(t) SGP(0, ?2?(?t ))
?(?t ) ??t, correlation function
continuous-time AR(1)

180
Diggle 3-component example

Fit D3C model (almost) to rats data
lme(logy group poly(Time,2),
random 1, cluster Subject,
serial.structure"ar1", datarats.ldf)
Compare variance estimates with variogram
plot(variogram(ksmooth.res(rats.ldf,
log(y), bygroup, band0)))

181
REML or ML

ML estimation
est.methodML
REML estimation (default)
est.methodRML
Model comparison (LR test) with REML
only when fixed models are the same!

182
Model testing

Use anova to compare models
obj1 lt- lme(logy grouppoly(Time,2),
1, Subject, datarats.ldf)
obj2 lt- lme(logy grouppoly(Time,2),
1Time, Subject, re.struct"diag",
datarats.ldf)
anova(obj1,obj2)

183
Simplifying the variance model

If you give a variance parameter a zero initial
estimate, it will be fixed at zero
Useful for testing whether a certain variance
component is necessary in the model
Use a Likelihood Ratio test
Use CS with covariance0 for ID

184
Parameter estimation a warning

It can be difficult to get reliable variance
parameter estimates from mixed models
flat likelihoods (and no standard errors!)
correlated parameters (and no est. correlations!)
easily overparameterised (with no warning!)
Restart your model with different initial
estimates to check convergence
start list(D matrix, alpha c())

185
Day 3 Session 2