Title: Oswald
1Oswald
- for Windows
- Lancaster University 7-9 June 1998
MathSoft International Knightway House Park
Street Bagshot, Surrey GU19 5AQ United Kingdom
David Smith S-PLUS Product Manager Phone 44
(0)1276 452299 ext. 212 Fax 44 (0)1276
451224 E-mail dsmith_at_mathsoft.co.uk
2Outline - Day 1
- 0915-1045 Intro to S-plus and Oswald
- 1045-1115 coffee
- 1115-1245 hands-on session
- 1245-1400 lunch
- 1400-1530 Longitudinal data
- 1530-1600 afternoon tea
- 1600-1730 hands-on session
3Outline - Day 2
- 0915-1045 Exploring longitudinal data
- 1045-1115 coffee
- 1115-1245 hands-on session
- 1245-1400 lunch
- 1400-1530 Basic modelling
- 1530-1600 afternoon tea
- 1600-1730 hands-on session
4Outline - Day 3
- 0915-1045 Advanced modelling
- 1045-1115 coffee
- 1115-1245 hands-on session
- 1245-1400 lunch
- 1400-1530 Other topics
- 1530-1600 afternoon tea
- 1600-1730 free laboratory session
5Day 1
- Introduction to S-plus and Oswald
6Day 1 Session 1
- What is S-PLUS?
- Oswald overview
- Starting S-PLUS for Windows
- The S command language
- Fundamentals -- expressions and assignments
- Vectors, matrices and lists
- Calling functions
7S-PLUS History
- 1980s S Language developed by ATT Bell Labs
- Added value S-plus distributed by StatSci Inc
- 1992 Mathsoft acquires Statsci, Axum
- 1997 S-plus 4.0 released
- Axum Graphics, S Language, Office GUI
- 1998 S-plus 4.5 (Windows) S-plus 5.0 (Unix)
released - Best-regarded statistical research tool in the
industry - 50,000 users world-wide
8S-PLUS 4.5For exploratory data analysis and
statistical modelling
- From basic statistics to cutting edge statistics
- Customisable and extensible GUI
- For data analysts and researchers of all levels
- Powerful, object-oriented S language
- With thousands of built-in functions
- Total control of data and graphics
- Worldwide S-PLUS users group
9Book Recommendations
- Modern Applied Statistics with S-plus
- W.N. Venables and B.D. Ripley
- Springer 1997 (second edition)
- The New S Language A programming environment for
data analysis and graphics - Becker, Chambers Wilks
- Wadsworth Brooks/Cole 1988
- Analysis of Longitudinal Data
- Diggle, Liang Zeger
- Oxford University Press 1994
10Oswald - Introduction
- Object-oriented Software for the Analysis of
Longitudinal Data
11Longitudinal Data
- A (largish) number of subjects
- people, households, rats, cows,
- A measurement taken repeatedly on each subjects
over a (smallish) number of times - Daily blood-pressure on 65 patients taken over 2
weeks. 30 patients on control drug, 35 on new
anti-hypertensive drug.
12Oswald features
- New data type for longitudinal data
- Subsample selection
- Creating replicated covariates
- Parallel plots
- Shadow plots
- Available for Windows and Unix
- Analysis of longitudinal data
- Kernel smoothing
- Rice-Silverman C-V
- Variogram
- Mixed modelling
- Dropout modelling
- Comes with GEE and ALR
13Using Oswald
- From the S-plus prompt
- library(oswald)
- The Oswald function and datasets are now
available.
14Starting S-PLUS
15(No Transcript)
16Object Browser
- S-plus is object-oriented
- data, functions, analysis results, plots, ...
- persistence (press DEL to delete)
- The Object Browser organises all the objects you
create - Left pane -- objects by classification
- Right pane -- list/components of objects
- Completely customisable
17Getting Data
- Data Select Data
- Select existing data, or import data from a file
- Dialog appears at startup (by default)
- Import data (later)
- Existing data enter the name of an S-plus object
in the Name box and click OK.
18Data Windows
- Data appears in a data window
- Like a spreadsheet, but column-oriented
- Includes column names and row labels
- Click on a column name to select a column
- Shift-click to select a block of columns
- Ctrl-click for non-adjacent columns
- Order of selection is important!
- Selected columns used for plots, analysis, ...
19Simple graphs
- Open the 2D or 3D plot palette
- Select column(s) for plotting from data window
- First ctrl-click -- x axis
- Second ctrl-click -- y axis
- (Third ctrl-click -- z axis)
- Click on the palette button corresponding to the
desired plot
20Trellis Graphs
- You can condition any plot by the value of any
other variable in the data - Create a plot
- Select the conditioning column in the data
window, and drag to the title area - Be sure to start dragging from the data
- Cursor changes to when ready to drop
- Plot is redrawn with panels for subranges of the
conditioning variable
21Example Ozone data
- Select Data
- Histogram of Ozone
- Scatterplot Temp vs Ozone
- With loess line
- Trellis conditioning on Wind
- 3-D plot Temp vs Wind vs Ozone
- Vary colour by Radiation
- Export graphics
22The Command Window
- Behind all of the menus and toolbars lies the S
command language - A complete programming language
- expressions, data types, loops, functions,
- object oriented
- We will mostly use Oswald by entering commands in
the S language using the Command Window - Click on to open.
23S Language absolute basics
- Enter an expression at the prompt gt
- S-plus prints out the result (usually)
- gt 11
- 1 2
- gt pi
- 1 3.141593
- gt sin((1sqrt(5))/2pi)
- 1 -0.9320324
24Secondary prompt
- If you fail to complete the expression, continue
at the prompt - gt exp(1/sqrt(2) -
- 0.5)
- 1 1.230114
- If you get stuck, enter lots of )s
- gt log(1sqrt(5 sin(2)
- )))))
- Syntax error No opening parenthesis, before ")"
at this point - log(1sqrt(5 sin(2)
25Vectors
- Many functions return vectors
- gt rnorm(6)
- 1 -1.4217905 0.7963978 -0.2487190 -0.8279492
- 5 0.7495827 1.0976972
- The n on the left shows where the row starts.
A single number is a length 1 vector - gt mean(rpois(50, mean5))
- 1 4.74
26Making vectors
- Sequences ab
- gt 110
- 1 1 2 3 4 5 6 7 8 9 10
- gt 5045
- 1 50 49 48 47 46 45
- Concatenation c(v1,v2,v3,)
- gt c(6,2,1)
- 1 6 2 1
- gt c(15,51)
- 1 1 2 3 4 5 5 4 3 2 1
27Assignment
- Store results in objects with lt- (gets)
- gt a lt- 110
- The result is not printed until you enter the
object name by itself - gt a
- 1 1 2 3 4 5 6 7 8 9 10
28Object Names
- Valid object names may contain only
- Letters abcXYZ
- Numbers 0123456789
- Dot .
- Valid names
- y Y weight x.var LD50 .my.dat sum
- Invalid names
- m-1 heart_rate 9th T F NA
29Handling objects
- List all your objects
- gt objects()
- 1 ".Last.value" ".Random.seed"
- 3 ".ldats.options" "a
- Objects are persistent and remain until removed
(even if you quit S-plus) - gt remove("a")
- gt a
- Error Object "a" not found
- Or, use the Object Browser
30Objects as variables
- Objects can be used in expressions
- gt a lt- 15
- gt sum(a)
- 1 15
- gt b lt- c(a, 10)
- gt length(b)
- 1 6
- gt 2b
- 1 2 4 6 8 10 20
31Vector arithmetic
- Scalar functions work elementwise
- gt a lt- 14
- gt sqrt(a)
- 1 1.000000 1.414214 1.732051 2.000000
- Scalar and vector arithmetic
- gt 2a
- 1 2 4 6 8
- gt 2a log(a)
- 1 2.000000 4.693147 7.098612 9.386294
32Logical vectors
- Expressions with relational operators return
logical vectors - ab agt5 blt0 a!0
- T is True, F is False
- gt a lt- rnorm(5)
- gt a
- 1 -0.1495632 0.1389647 2.3571278 2.1495335
- 5 -1.8157597
- gt a lt 0
- 1 T F F F T
33Missing values -- NA
- The missing value is represented by NA
- gt a lt- c(1, NA, 2)
- Most operations on NA return NA
- gt a 1
- 1 2 NA 3
- gt sum(a)
- 1 NA
- is.na checks for missing values
- gt is.na(a)
- 1 F T F
Not aNA
34Vector indexing
- Use to select elements of a vector
- gt a lt- c(2,3,5,7,11,13,17)
- gt a1
- 1 2
- gt a35
- 1 5 7 11
- gt ac(1,3,5)
- 1 2 5 11
- Negative indices remove elements
- gt a-(13)
- 1 7 11 13 17
35Logical indices
- A logical index selects elements
- index as long as the indexed vector
- gt b lt- rnorm(8)
- gt bblt0
- 1 -0.04100 -0.86600 -0.07229
- gt log.b lt- log(b)
- Warning messages
- NAs generated in log(x)
- gt bis.na(log.b)
- 1 -0.04100 -0.86600 -0.07229
- gt log.b!is.na(log.b)
- 1 0.7242 0.9154 0.6976 -0.1279 0.2408
36Replacement
- You can also use on the left hand side of an
assignment - gt a lt- sample(110)
- gt a
- 1 2 1 10 9 7 8 3 6 4 5
- gt a5 lt- NA
- gt a
- 1 2 1 10 9 NA 8 3 6 4 5
- gt ais.na(a) lt- 0
- gt a
- 1 2 1 10 9 0 8 3 6 4 5
37Calling functions
- Functions are called like this
- function.name(argument, argument, ...)
- Functions always return a value
- NULL represents no value
- Function arguments have a
- position (first, second, )
- name
- default value (sometimes)
38Example rep
- rep(x, times inferred, length.out
inferred) - gt rep(13,2)
- 1 1 2 3 1 2 3
- gt rep(13,length8)
- 1 1 2 3 1 2 3 1 2
- gt rep(13,c(3,2,1))
- 1 1 1 1 2 2 3
- gt rep()
- Error in rep Argument "x" is missing, with no
- default rep()
- Dumped
39Example seq
- seq(from1, toend, by1,
- lengthinferred, alongNULL)
- gt seq(1,5)
- 1 1 2 3 4 5
- gt seq(10, 20, length6)
- 1 10 12 14 16 18 20
- gt seq(to100, by15, length7)
- 1 10 25 40 55 70 85 100
- gt seq(length10)
- 1 1 2 3 4 5 6 7 8 9 10
40Matrices
- matrix(dataNA, nrowinferred, ncolinferred,
byrowF, dimnamesNULL) - gt X lt- matrix(110, nrow2)
- gt X
- ,1 ,2 ,3 ,4 ,5
- 1, 1 3 5 7 9
- 2, 2 4 6 8 10
- gt dim(X)
- 1 2 5
41Matrix indices
- gt X2 lt- matrix(115, nrow3, byrowT)
- gt X22,3
- 1 8
- gt X212,
- ,1 ,2 ,3 ,4 ,5
- 1, 1 2 3 4 5
- 2, 6 7 8 9 10
- gt X223, 35
- ,1 ,2 ,3
- 1, 8 9 10
- 2, 13 14 15
- gt X2,1
- 1 1 6 11
gt X2,1,dropF ,1 1, 1 2,
6 3, 11
42Transpose and multiply
- Transpose t(X)
- gt X lt- cbind(rep(1,5),15)
- gt t(X)
- ,1 ,2 ,3 ,4 ,5
- 1, 1 1 1 1 1
- 2, 1 2 3 4 5
- Matrix multiply
- gt t(X) X
- ,1 ,2
- 1, 5 15
- 2, 15 55
43Matrix algebra
- Matrix inverse solve
- gt solve(t(X) X)
- ,1 ,2
- 1, 1.1 -0.3
- 2, -0.3 0.1
- Decompositions
- Eigenvector (eigen)
- Singular value (svd)
- Cholesky (chol)
gt eigen(t(X)X) values 1 1.1831
0.0169 vectors ,1 ,2 1,
0.9637 0.2669 2, -0.2669 0.9637
44Lists
- An ordered collection of arbitrary objects
- gt mylist lt- list(dat15, name"short",
- locc(1.0,-2.5))
- gt mylist
- dat
- 1 1 2 3 4 5
- name
- 1 "short"
- loc
- 1 1.0 -2.5
gt mylistname 1 "short" gt mylist3 1 1.0
-2.5 gt names(mylist) 1 "dat" "name" "loc"
45Getting help
- Help on functions
- ?sample
- help(sample)
- General help Help menu
- Function index, keyword search
- On-line manuals (Users/Programmers/Statistics)
- Visual demo
46Stop!
- To cancel the current command, press ESC
- Cancel long outputs
- Abort long computations
- To quit from S-plus (command prompt)
- gt q()
- Or File Exit from menu bar
47Day 1 Session 2
- Practical
- S-plus tutorial
48Day 1 Session 3
- Reading data and data import
- Data frames and factors
- ldframe objects
- adding columns
- sub-selections
- summary
- ssv/tsv
49Reading in data
- File Import Data From File
- Choose file and file type from dialog
50Data import options
- Select the Options tab for import options
- By default, ASCII files are assumed to have
- column labels on row 1 (blank rows ignored)
- data from row 2 onwards
- columns separated by whitespace or comma
- Change defaults with import options
- First data line, separators, column spec
- NB .dat files assumed to be Gauss files!
51Data objects
- When data are imported, the data are displayed in
a data window and a new data object is created - data frame
- The name of the data object is chosen from the
file name, or can be specified in the import
dialog.
52Data Frames
- data.frame
- usually used for data
- Like a matrix with named columns
- columns may be of different types
- rows have labels
- gt catalyst
- Temp Conc Cat Yield
- 1 160 20 A 60
- 2 180 20 A 72
- 3 160 40 A 54
- 4 180 40 A 68
- 5 160 20 B 52
- 6 180 20 B 83
- 7 160 40 B 45
- 8 180 40 B 80
53Data frame matrix?
- You can operate on a data frame as if it were a
matrix - gt catalyst15,-3
- Temp Conc Yield
- 1 160 20 60
- 2 180 20 72
- 3 160 40 54
- 4 180 40 68
- 5 160 20 52
- But numeric functions usually dont work
- gt mean(catalyst,-3)
- Error in as.double Cannot coerce mode list to
- double .Data list(..
- Dumped
- gt mean(
- data.matrix(catalyst,-3))
- 1 22.42
54Data frame list
- A data frame is actually a list whose components
are all of equal length - gt names(catalyst)
- 1 "Temp" "Conc" "Cat" "Yield"
- gt catalystConc
- 1 20 20 40 40 20 20 40 40
- gt catalystnewcol lt- rnorm(8)
55Data frame summary
- gt summary(catalyst)
- Temp Conc Cat
- 1604 204 A4
- 1804 404 B4
- Yield newcol
- Min. 45.0 Min. -0.9880
- 1st Qu.53.5 1st Qu.-0.3780
- Median 64.0 Median -0.1610
- Mean 64.2 Mean -0.0075
- 3rd Qu.74.0 3rd Qu. 0.2890
- Max. 83.0 Max. 1.3700
56Factors
- A factor is a vector whose elements are
categories (levels) - Used for categorical data in modelling
- To check class of object class(object)
- gt f lt- as.factor(str) convert string to factor
- gt str lt- as.character(f) convert factor to
string - gt val lt- as.numeric(as.character(f))
- convert factor with numeric levels to numbers
57Factor example
- gt class(catalystConc)
- 1 "factor"
- gt levels(catalystConc)
- 1 "20" "40"
- gt catalystConc2 lt- 30
- Warning messages
- replacement values not all in levels(x) NA's
- generated in "lt-.factor"(.A0, 2, value .A1)
- gt catalystConc
- 1 2 3 4 5 6 7 8
- 20 NA 40 40 20 20 40 40
58Longitudinal data
- We have
- m subjects, each with a subject id
- subject ids are usually 1, 2, 3, , m
- ni events for each subject i1,,m
- responses, covariates
- each event occurs at a particular time tij
- We will use a rectangular data structure with one
row per event.
59Rat bodyweight data
- gt rats.ldf
- y Time Subject group
- 1 57 1 1 Control
- 2 86 2 1 Control
- 3 114 3 1 Control
- 4 139 4 1 Control
- 5 172 5 1 Control
- 6 60 1 2 Control
- 51 59 1 11 Thyroxin
- 52 85 2 11 Thyroxin
- 134 104 4 27 Thiouracil
- 135 122 5 27 Thiouracil
60ldframe objects
- Like data frames, but required to have these two
columns (at least) - Time the time point of the event.
- index number (1,2,3, )
- days since start of study
- weeks since calving
- Subject the subject id code
- identifies which events belong to which subject.
- all observations on each subject must appear in a
contiguous block
61ldframe object -- data
- Other columns are used for data
- responses
- its best to use the first column for the most
important response - covariates
- constant covariates (e.g. Sex) are simply
repeated for each event within each subject - These columns can have any (legal) name
62Converting data.frame to ldframe
- Convert data frame to ldframe with as.ldframe
- my.ldf lt- as.ldframe(my.df)
- If the data frame does not have Subject and/or
Time columns, specify them here. - names(Orthodont)
- 1 "Subject" "Sex" "age" "distance"
- Orthodont.ldf lt- as.ldframe(Orthodont,Timeage)
63Two-way indexing
- , indexing as for matrix or data.frame
- gt rats.ldf12,
- y Time Subject group
- 1 57 1 1 Control
- 2 86 2 1 Control
- gt mean(rats.ldf,"y")
- 1 100.8
- gt rats.ldf123,"group"
- 1 Thiouracil
- Levels
- 1 "Control" "Thiouracil" "Thyroxin"
64Column operations
- Select columns using list () notation
- gt gp lt- rats.ldfgroup
- gt table(gp)
- Control Thiouracil Thyroxin
- 50 50 35
- Add or delete columns using notation
- gt rats.ldfbodymass lt- newdata
- gt rats.ldfy lt- NULL
65Examples
- First 3 observations from all subjects
- gt rats.ldfrats.ldfTimelt3,
- All from the Thyroxin group
- gt rats.ldfrats.ldfgroupThyroxin,
- Large responses from the Control group
- gt rats.ldfrats.ldfgroup"Control"
- rats.ldfygt170,
- y Time Subject group
- 5 172 5 1 Control
- 10 177 5 2 Control
- 15 185 5 3 Control
66Occurrences -- nths
- nths returns an index vector to nth within-block
occurrences in a block-sorted vector - nths(n, vec)
- vec is the vector to index
- n1 first occurrences
- n13 first three
- n-1 last occurrence
- nc(1,-1) first and last
67Example
- gt vec
- 1 a a a b b b b b c c c c c c d d
- gt nths(1,vec)
- 1 1 4 9 15
- gt nths(13,vec)
- 1 1 2 3 4 5 6 9 10 11 15 16
- gt nths(-1,vec)
- 1 3 8 14 16
- gt nths(c(1,-1),vec)
- 1 1 3 4 8 9 14 15 16
4
9
15
68Practical example
- gt idx lt- nths(c(1,-1),rats.ldfSubject)
- gt rats.ldfidx,
- y Time Subject group
- 1 57 1 1 Control
- 5 172 5 1 Control
- 6 60 1 2 Control
- 10 177 5 2 Control
- 11 52 1 3 Control
- 15 185 5 3 Control
- 16 49 1 4 Control
- 20 164 5 4 Control
- ...
-
69Selecting by subject
- Use a single index to select all events for
selected subject(s) - gt rats.ldf1
- y Time Subject group
- 1 57 1 1 Control
- 2 86 2 1 Control
- 3 114 3 1 Control
- 4 139 4 1 Control
- 5 172 5 1 Control
70More than one subject
- gt rats.ldfc(10,27)
- y Time Subject group
- 46 57 1 10 Control
- 47 82 2 10 Control
- 48 110 3 10 Control
- 49 139 4 10 Control
- 50 169 5 10 Control
- 131 53 1 27 Thiouracil
- 132 72 2 27 Thiouracil
- 133 89 3 27 Thiouracil
- 134 104 4 27 Thiouracil
- 135 122 5 27 Thiouracil
71Selecting by subject id
- To select by subject id, use quotes
- gt rats.ldf.rev lt- rats.ldf271
- gt rats.ldf.rev1,
- y Time Subject group
- 131 53 1 27 Thiouracil
- gt rats.ldf.rev"15"12,
- y Time Subject group
- 71 57 1 15 Thyroxin
- 72 72 2 15 Thyroxin
Note row labels stay with the correct event even
after the data has been reordered.
72Bad ordering
- Selecting so that subjects arent in blocks
anymore - gt rats.perm lt- rats.ldfsample(1135),
- Warning messages
- Selection resulted in invalid ldframe object
- converted to data.frame in ".ldframe\
- "(rats.ldf, sample(1135), )
- gt rats.perm13,
- y Time Subject group
- 30 153 5 6 Control
- 8 123 3 2 Control
- 29 131 4 6 Control
- gt rats.perm lt- rats.ldfsample(127)
73Replication
- Some variables may be subject-specific
- Constant (replicated) within a subject id
- e.g. Sex, Treatment Group
- Some variables may be time-specific
- Constant (replicated) within a time point
- e.g. daily temperature, regression dummy variables
74ldframe summary
- gt summary(rats.ldf)
- Longitudinal data frame (ldframe) object
- Number of subjects 27
- Number of time points 5
- Total events 135
- Distribution of number of observations per
- Subject Time
- 527 275
- Subject-specific variables
- group
- Control 10
- Thiouracil10
- Thyroxin 7
- Time-specific variables none
- Non-replicated variables
- y
- Min. 46
- 1st Qu. 71
- Median 100
- Mean 101
- 3rd Qu.122
- Max. 189
75ssv
- ssv(ldframe)returns all subject-specific columns,
condensed into one-per-subject - gt ssv(rats.ldf)
- Subject group
- 1 1 Control
-
- 17 17 Thyroxin
- 18 18 Thiouracil
- 27 27 Thiouracil
76selective ssv
- To select a specific subject-specific column
- gt gps lt- ssv(rats.ldf,"group")
- gt data.class(gps)
- 1 "data.frame"
- gt table(gpsgroup)
- Control Thiouracil Thyroxin
- 10 10 7
- gt ssv(rats.ldf,"y")
- NULL
- Warning messages
- Variable y exists, but is not
- subject-specific in ssv(rats.ldf, "y")
77Adding columns with ssv
- ssv(obj,"sex") lt- dat
- gt sex lt- rep(c("m","f"),c(10,17))
- gt ssv(rats.ldf,"sex") lt- sex
ldframe object
new col name
vector as long as number of subjects in obj
78ssv results
- gt rats.ldfc(1,27)
- y Time Subject group sex
- 1 57 1 1 Control m
- 2 86 2 1 Control m
- 3 114 3 1 Control m
- 4 139 4 1 Control m
- 5 172 5 1 Control m
- 131 53 1 27 Thiouracil f
- 132 72 2 27 Thiouracil f
- 133 89 3 27 Thiouracil f
- 134 104 4 27 Thiouracil f
- 135 122 5 27 Thiouracil f
79Another example
- gt span lt- function(x) xlength(x)-x1
- gt growth lt- tapply(rats.ldfy, rats.ldfSubject,
- span)
- gt growth
- 1 2 3 4 5 6 7 8 9 10 11 12
- 115 117 133 115 95 107 90 91 91 112 122 84
- 13 14 15 16 17 18 19 20 21 22 23 24 25 26
- 133 118 87 88 119 68 63 80 63 89 68 52 82 61
- 27
- 69
- gt ssv(rats.ldf,"growth") lt- growth
80Time-specific variables
- tsv is like ssv for time-specific variables
- varies within subject
- constant for each time point
- Sometimes data are unbalanced -- each subject
measured at different times - in this case tsv requires a value for each unique
time point
81tsv example
- gt tsv(rats.ldf, "temp") lt- c(21,20,25,28,35)
- gt rats.ldf1011
- y Time Subject group growth temp
- 46 57 1 10 Control 112 21
- 47 82 2 10 Control 112 20
- 48 110 3 10 Control 112 25
- 49 139 4 10 Control 112 28
- 50 169 5 10 Control 112 35
- 51 59 1 11 Thyroxin 122 21
- 52 85 2 11 Thyroxin 122 20
- 53 121 3 11 Thyroxin 122 25
- 54 146 4 11 Thyroxin 122 28
- 55 181 5 11 Thyroxin 122 35
82Help Files
- data.frame.object
- read.table
- ldframe
- as.ldframe
- ldframe.object
- nths
- summary.ldframe
- ssv
- tsv
83Day 1 Session 4
- Practical
- Read in data
- get summaries
- create subsets of data
- using tsv/ssv
- complex subsetting -- nths
84Day 2
- Exploring longitudinal data
85Day 2 Session 1
- Plotting
- Smoothing
- Cross-validation
- Object-orientation
86Plotting longitudinal data
- Goals
- Show how response varies with time
- Show variation within subject across time
- Show variation between subjects
- Show variation between groups (covariates)
- Highlight unusual observations missing data
- Problem too much data!
87Scatterplot?
- Scatterplot of yij versus tij
88Basic parallel plot
- Create a scatterplot of yij versus tij
- Connect observations within each subject i with
lines - "traces"
- Shows overall variation
- Allows observing within-subject behaviour by
following lines
89Basic plot
- The standard S-plus plot function creates a
parallel plot for ldframe objects - gt plot(rats.ldf)
- NB plots created from the command line cannot
(easily) be modified using the GUI
90Milk Protein Data
- 79 cows
- Three treatment levels (group)
- Barley diet
- Lupins diet
- Mixed (mixture of barley and lupins)
- Response y milk protein content
- Time is weeks since calving (up to 19)
- Missing data
91milk.ldf
- gt idx lt- nths(c(12,-(12)),milk.ldfSubject)
- gt milk.ldfidx,c(78,4)
- y Time Subject group
- 1464 3.67 1 78 Lupins
- 1465 3.26 2 78 Lupins
- 1481 3.35 18 78 Lupins
- 1482 3.09 19 78 Lupins
- 58 3.66 1 4 Barley
- 59 3.50 2 4 Barley
- 75 3.28 18 4 Barley
- 76 NA 19 4 Barley
- gt plot(milk.ldf)
92Highlighting by group
- Subject-specific columns can be used to
differentiate subjects by colour, line style, or
plotting character. - gt plot(milk.ldf, colorgroup)
- gt plot(milk.ldf, linegroup)
- gt plot(milk.ldf, pchargroup,
- generallist(type"b"))
- Combine color line pchar for 2 or 3-way
discrimination.
93Milk data plot
- gt plot(milk.ldf, colorgroup)
- Colours chosen according to levels of group
- Missing data highlighted
- Dashed lines for intermittent missing values
- Red crosses for dropouts
94Adding a legend
- Use the legend argument to add a legend to the
plot - gt plot(milk.ldf, colorgroup, legendlocator(1))
- The legend is created automatically using
information from the plot specification.
95Shadow plots
- The previous plot is very "busy"
- difficult to follow individual trails
- difficult to distinguish between-group behaviour
- Idea plot most traces as light "shadows" to
reduce clutter but maintain impression of
variation.
96Shadow plot
- The n.full argument specifies the number of
traces to plot in full - chosen at random
- the remainder as shadows
- use number between 0 and 1 for percentage
- plot(milk.ldf, colorgroup,
- n.full0.15)
97Selector function
- When n.full is used, the full traces are selected
at random. - Alternative select an "even spread" according to
some statistic of the trace - The selector argument may be used to specify a
function to calculate the statistic - plot(milk.ldf, n.full12, colorgroup,
- selectorfunction(y) y1)
98Selector notes
- The traces with the minimum and maximum
statistics are always chosen - If the selector function returns NA, that trace
will never be chosen - So be careful when using functions like mean as
the selector function - If there are ties in the statistics, traces are
randomly selected within ties
99Example selector functions
- selector
- function(y) mean(y, na.rmT)
- function(y) max(y, na.rmT)
- function(y) diff(range(y, na.rmT))
- function(y) sum(!is.na(y))
- function(y) rev(y!is.na(y))1
100Overlaying plots
- When used with ldframe objects, lines and points
add to the current plot - lines -- parallel plot
- points -- scatterplot
- plot(milk.ldf, n.full0)
- lines(milk.ldfc(1,27,76),
- colorgroup)
101Different response
- By default, plot.ldframe uses the first column of
the ldframe as the response - To use a different response, supply an expression
as the second argument - You can use any of the column names in the
expression - plot(rats.ldf, log(y), colorgroup)
- plot(my.ldf, var2)
102Different x variable
- By default, plot.ldframe uses the Time column as
the x variable - To use a different x variable supply an
expression for argument 3 (x.expr) - plot(rats.ldf, x.exprlog(Time))
- plot(cog.ldf, x.exprAge)
- plot(mydat, x.exprjitter(Time))
103Customization
- There are many optional arguments to plot.ldframe
to change the appearance of the plot. - These can be included in the call to plot, or use
tsopts to make permanent changes - See ?tsopts for a complete list
- plot(milk.ldf, colorgroup, colsc(8,5,6))
- tsopts(colsc(8,5,6))
104general
- A list of options to be passed to the standard
plot function. - generallist(xlab"Hours")
- generallist(type"o", cex2)
- generallist(main"Rat Data")
- gt plot(milk.ldf, generallist(type"o", pch""))
- See ?plot, ?par
105full
- A list of graphics parameters to use when
plotting traces in full (see ?par) - fulllist(lwd2)
- fulllist(lty3, pch"o")
- fulllist(col6)
- gt plot(rats.ldf, fulllist(lwd2, col3))
- Also shadows for shadow traces.
106cols, pchs, ltys
- Used when the color, line, or pchar arguments are
used in plot - cols colours to cycle through
- pchs plotting characters
- ltys line types
- tsopts(pchs115)
- plot(rats.ldf, pchargroup,
- generallist(type"b"), legendlocator(1))
107Highlighting missing values
- miss.style graphics parameters to use when
drawing a line between two points in a trace
separated by missing value(s) - dropout.style graphics parameters to use when
plotting the last point in a trace terminated by
missing value(s). Also dropin.style - tsopts(dropout.styleNULL)
108Mean vs time
- How best to observe changes in sample mean over
time? - Possibly within several treatment groups
- Within-time-point means?
- What if data are unbalanced?
- What if there is too much variability?
- What about correlation?
109Kernel Smoothing
- Assume a single group for the moment.
- Basic idea estimate mean at time t using a
weighted average of points on all subjects
observed near t - Bigger weights nearer to t
- Weights defined by a kernel function K(t)
110Kernel function
K(t)
h
t
- The "width" of the kernel function is specified
by a bandwidth h
111Kernel estimation
- Model yi(t) µ(t) ?i(t)
- Data (yij, tij)
- weight for yij w(t, tij, h) K((t-tij)/h)
112ksmooth.lda
- Simplest specify ldframe and bandwidth
- sm1 lt- ksmooth.lda(milk.ldf, band3.5)
- plot(sm1)
- Uses a Normal kernel by default
- Returns an ldframe with a single subject
- smooth calculated at all unique time points by
default
113The bandwidth
- If the bandwidth is large, the kernel estimate
will be very smooth - A small bandwidth will reflect local behaviour,
and be "rougher" - A zero bandwidth will calculate the average of
any points which exists at the time point.
114Bandwidth example
- Pass a vector to bandwidth to calculate many
smooths at once. - sm2 lt- ksmooth.lda(milk.ldf,
- bandc(0,2,3.5, 10, 100))
- plot(sm2, Smooth1, fulllist(col1))
- lines(sm2, Smooth2, fulllist(col2))
- lines(sm2, Smooth3, fulllist(col3))
- lines(sm2, Smooth4, fulllist(col4))
- lines(sm2, Smooth5, fulllist(col5))
115More points
- The kernel smooth can be calculated at any time
point - Specify time points to calculate at with the
x.points argument - sm3 lt-ksmooth.lda(milk.ldf, band3.5,
- x.pointsseq(-1,22,length1000))
- plot(sm3)
116Smoothing within groups
- Often we use smoothing to look at differences
between treatment groups - Specify the groups with the by argument
- A factor variable classifying the data points
into subgroups (eg treatment group) - The result ldframe will have as many traces as
there are levels in by
117Within-groups example
- Calculate smooths within each of the three diets
for the milk protein data, and plot the smooths
with the raw data. - sm4 lt- ksmooth.lda(milk.ldf,
- band3.0, bygroup)
- plot(milk.ldf, n.full0)
- lines(sm4, colorSubject,
- legendlocator(1))
118Changing the kernel function
- You can specify any kernel function with the
kernel argument - A scalar valued function of one argument
- Default kernels kernel.knormal,
ksmooth.kbox,ksmooth.ktriangle. - All have quartiles at 0.25
- The choice of kernel function isn't nearly as
important as the choice of bandwidth
119Kernel example
- Create a new kernel function and plot it
- mykernel lt- function(u)
- ifelse(abs(u)lt1, 1, 1/abs(u))
- x lt- seq(-2,2,length100)
- plot(x,mykernel(x),type"l")
- Use it on the milk protein data
- plot(ksmooth.lda(milk.ldf, bygroup,
- band0.5, kernelmykernel))
120Selecting the bandwidth
- The Rice-Silverman Criterion
- Set bandwith at h
- Remove subject i from the data
- Calculate the kernel smooth
- Add up the squared differences between subject i
and the kernel smooth - Sum over all subjects
- Minimise over h to determine bandwidth
121ksmooth.rsc
- ksmooth.rsc works like ksmooth.lda, but returns
the criterion value instead - gt ksmooth.rsc(milk.ldf, band1.5)
- bandwidth RSC
- 1, 1.5 132.3614
- gt rsc lt- ksmooth.rsc(milk.ldf,
- bandseq(0,4,length30))
- gt plot(rsc)
122Grouped RSC
- If you use the by argument with ksmooth.rsc, you
get the sum of the criterion values within the
subgroups. - rsc lt- ksmooth.rsc(milk.ldf, bygroup,
- bandseq(0,2,length30))
- plot(rsc)
123For more information
- ?plot.ldframe
- ?tsopts
- ?ksmooth.lda
- ?ksmooth.rsc
- Why plot.ldframe?
124Object-oriented
- S-plus is object oriented
- Some objects have a class which identifies the
type of object. - gt class(milk.ldf)
- 1 "ldframe" "data.frame"
- milk.ldf is of class ldframe, which inherits from
class data.frame
125Method functions
- Some "generic" functions behave differently
depending on the class of the first argument - plot, summary, print, operators,
- You call plot(milk.ldf)
- S runs plot.ldframe(milk.ldf)
126Day 2 Session 2
- Practical -- plotting and smoothing
127Day 2 Session 3
- Model specification
- OLS models -- lm
- Missing data issues -- na.add
128Linear Models
129Models in S-plus
- formula notation
- response predictor predictor
- response vector
- predictors vector, factor
- Intercept term (1) automatically included unless
specifically excluded
130Model formulae
131Y 1
Response
Time
132Y Time
Response
Time
133Y Time Time2
Response
Time
134Y poly(Time,3)
Response
Time
135Y group Time
Response
Time
136Y groupTime
Response
Time
137More formula examples
- y group - 1
- y group - 1 Time
- log(y) sex group/Time
- log(y) sex group/Time - group
- y poly(Time,3) sexgroup
- y group/ns(Time,4)
- (4 df natural splines within each group -- see
also bs)
138Factor contrasts
- By default, S-PLUS uses Helmert contrasts
- options(
- contrastsc("contr.treatment","contr.poly"))
- Intercept -- level 1 estimate
- Factor g estimate -- Level g - Level 1
139Ordinary Least Squares
- Although we know there is within-subject
correlation, sometimes useful to ignore it - The clever ostrich
- Point estimates of parameters consistent and
usually fairly efficient - Standard errors can be very wrong (small)
- but possible to get better estimates
140Fitting OLS models lm
- Basic usage
- obj lt- lm(formula, datadataframe)
- formula is a model formula
- dataframe is the name of a data frame in which to
find the formula variables - obj is an object to store the results
141Model results
- You can print or extract various components of
the result object - obj brief details
- summary(obj) more details
- coef(obj) estimated coefficients
- anova(obj) AOV table
- fitted(obj) fitted values
- resid(obj) residuals
- plot(obj) diagnostic plot
142lm example
- fit lt- lm(log(y) group
- poly(Time,2), datarats.ldf)
- fit
- coef(fit)
- summary(fit)
- anova(fit)
- plot(rats.ldf, fitted(fit))
- plot(rats.ldf, resid(fit))
- plot(fit)
143Model comparison
- You can compare two model objects using
anova(obj1,obj2) - for lm this gives an AOV table with an F test for
each term (not parameter) - fit2 lt- lm(log(y)
- group poly(Time,3), datarats.ldf)
- anova(fit1,fit2)
144Prediction
- predict(obj) return fitted values
- predict(obj,newdata) make predictions
- newdata is a data frame with new covariate values
to predict at - WARNING predict gives wrong answers when the
parameterisation depends on the covariate values
(eg poly) - use predict.gam instead (safe prediction)
145predict example
- newdata lt- ldframe(Subjectrep(13,rep(100,3)))
- tsv(newdata,"T2") lt- seq(1,6,length100)
- newdataTime lt- newdataT2
- ssv(newdata,"group") lt- levels(rats.ldfgroup)
- plot(rats.ldf, log(y),
- generallist(xlimc(1,6)))
- lines(newdata, predict.gam(fit2,newdata),
- colorSubject)
- lines(newdata, predict(fit2,newdata),
- colorSubject) wrong!!
146coef example
- Lets look at variation at residual slope and
intercept across subjects - resfit lt- lm(resid(fit)
- Subject/Time - 1, datarats.ldf)
- co lt- coef(resfit)
- hist(co127)
- hist(co-(127))
147Missing data
- If the response or any covariates contain NAs, lm
will fail with an error message - gt lm(y group, datamilk.ldf)
- Error in function(object, ...) missing values
not allowed found in y - Dumped
- Include the option na.actionna.omit to delete
all rows with missing values before doing
calculations
148na.omit example
- gt fit3lt-lm(ygroup, datamilk.ldf,
na.actionna.omit) - gt fit3
- Call
- lm(formula y group, data milk.ldf,
na.action na.omit) - Coefficients
- (Intercept) group1 group2
- 3.42467 -0.1097727 0.002512743
- Degrees of freedom 1337 total 1334 residual
- Dropped 164 cases due to missing values
- Residual standard error 0.3198013
149na.omit a problem
- na.omit works by temporarily deleting rows with
missing data from the data frame - lm thinks there are fewer rows than there are
- resid and fitted return vectors as long as the
number of non-missing rows - gt plot(milk.ldfy, resid(fit3))
- Error in plot Lengths of x and y must match
- gt length(milk.ldfy)
- 1 1501
- gt length(resid(fit3))
- 1 1337
150na.add a solution
- You can use na.add to add a shortened vector to
an ldframe, expanding it by inserting missing
values where needed. - na.add(milk.ldf, resid) lt- resid(fit3)
- milk.ldf41719,
- y Time Subject group resid
- 74 3.35 17 4 Barley -0.1819294
- 75 3.28 18 4 Barley -0.2519294
- 76 NA 19 4 Barley NA
151na.add example
- Milk data 4df splines within each treatment
group. - a lt-lm(y group/ns(Time,4), datamilk.ldf,
na.actionna.omit) - af lt- fitted(a)
- na.add(milk.ldf, "spline") lt- af
- plot(milk.ldf, spline, colorgroup)
152na.omit an alternative
- Alternatively, you can simply use na.omit to
remove all rows with missing data from your data
before you begin. - milk.nona.ldf lt- na.omit(milk.ldf)
- Not always appropriate
- Oswald functions work best with missing data left
in.
153Day 2 Session 4
- Practical
- Fitting models
- Getting coefficients
- Making predictions
- Robust SEs
154Day 3
- Analysing Longitudinal Data
155Day 3 Session 1
- Variogram
- Parametric correlation models
- Likelihood ratio testing
- Structured random effects
156The variogram
- Assume our data are replications of a stationary
process Y(t) - The variogram is used to explore the correlation
structure
157Variogram
- When u is small
- ?(u)gt0 -- Measurement error
- When u is large
- if ?(u)ltlt?2, no independence even at large lags
- random intercept
158Creating stationarity
- Before we can calculate the variogram, need to
de-trend the data - Calculate the most complex model imaginable, and
take residuals - For balanced data, residuals from saturated
treatmenttime model - For unbalanced data, residuals from kernel smooth
by treatment group
159Stationarity -- balanced data
- Kernel smoothing with zero bandwidth calculates
means within group at each time - ksmooth.res returns residuals
- res lt- ksmooth.res(milk.ldf, bygroup, band0)
- plot(res, n.full10, colorssv(milk.ldf)group)
- Also works with "slightly" unbalanced data
160Stationarity -- unbalanced data
- Choose the most complex classification
- Choose bandwidth with RSC
- Kernel smooth and take residuals with ksmooth.res
161Sample variogram
- Take stationary residuals rij
- Calculate vijk½(rij- rik)2, uijktij- tik
- Variogram estimate v(u) at lag u is average of
all vijk where uijk is near u - "Near" means within a binwidth eps
- Also estimate process variance ?2
- average ½(rij- rlk)2 for i? l
162Calculating the variogram
- res lt- ksmooth.res(milk.ldf,
- bygroup, band0)
- vg lt- variogram(res)
- plot(vg)
- The variogram is calculated at all unique lags at
least 1 binwidth apart - Change the binwidth with the eps argument
163Variogram cloud
- You can also calculate and plot all the vijk's by
including the cloudT argument - vg2 lt- variogram(res, cloudT)
- plot(vg2)
- The plot includes the cloud as a scatterplot
- Warning there can be many cloud points, which is
why the cloud is not returned by default
164Variogram interpretation
- Intercept
- measurement error
- Process variance - stable value at large lags
- random intercept
- if it doesnt stabilise, not stationary!
- Rate of increase to stable value
- serial correlation
165Mixed Effects Modelling
- Sometimes, it makes sense to include random
coefficients in our model - random intercept
- random slope
- This is useful in many ways
- more natural models
- more efficient parameterisation
- explains correlation
166Mixed effects examples
- Random intercept
- Yij (? Ui) ?tij ?ij
- Random slope
- Yij (? Ui) (? Si) tij ?ij
- Multilevel effects
- Yij (? Ui) (? Si) tij Gg ?ij
-- no can do!
167Structured random effects
- Y X?Z? ?
- Random effects ? Var(?)G
- Serial correlation ? Var(?)R
- Random effects zero-mean, Normal
- include fixed effects for nonzero mean
- ? realised independently for each subject
- 2-level model (multilevel gt 2 not possible)
168Basic lme usage
- How to specify, Z, G, R?
- lme(fixed, random, cluster, data)
- fixed linear model (X)
- random RE model (Z)
- cluster independent units
- usually cluster Subject
- data ldframe (or dataframe)
- Other options for specifying G, R, ML/REML,
(see ?lme.formula)
169Random effect design
- Specify random model formula for Z
- random 1 random intercept
- random Time random slope intercept
- NB no LHS to formula!
- By default, all fixed components included in
random model. - All random effects assumed correlated
- general (unstructured) form for G
170lme -- rats
- NB lme does not allow within-formula
transformations - rats.ldflogy lt- log(rats.ldfy)
- Parallel quadratics, random intercept
- lme(logy group poly(Time,2), 1,
- clusterSubject, datarats.ldf)
171lme -- orthodontic data
- Orthodont.plot()
- orth1 lt- lme(fixed distance age,
- random age, cluster Subject,
- data Orthodont)
- orth1
- summary(orth1)
172Predictions
- Population predictions
- assuming ?0 (i.e. EY)
- Cluster predictions
- EYi Zmode(?ik Y)
- predict(obj, newdata, cluster)
- predict(orth1, Orthodont, Subject)
173Diagnostic plots
- lme provides 3 types of diagnostic plot
- scatterplot matrix of predicted REs
- plot(orth1,optionc)
- Residual plots
- plot(orth1, optionr)
- response vs fitted values
- residuals vs fitted values
- residuals by cluster
- Standardised residual plots
- plot(orth1, options)
174RE covariance structures
- By default, all between-random-effect covariances
are estimated independently - unstructured G
- With many random effects, a simpler structure may
be preferred.
175re.structure
- re.structureunstructured
- independently estimate all correlations (default)
- re.structurecompsymm (compound symmetry)
- all variances and covariances equal
- re.structurediagonal
- independent, different variances
- re.structureidentity
- independent, common variance
- re.structurear1
- common variance, AR(1) correlation
176More complex structures
- re.block can be used to group random effects into
independent blocks - re.block list(c(1,2),3)
- re.structure is given as a vector for the
strucure of each block - re.structurec(compsymm,identity)
- Or, you can design your own structures by writing
an S function
177Serial correlation
- Structure for Var(?i) (and therefore R)
- serial.structure
- identity (constant variance, default)
- ar1 (AR(1), unit spacing)
- compsymm
- ar2, ma1, ma2, arma11
- DIY
- Dont overparameterise!!
178Balanced or unbalanced?
- Random effects model
- no problem here
- Serial correlation model
- identity, compsymm OK
- ar1 etc. assume unit time spacing by default
- lme(, serial.structurear1,
- serial.covariate Time, serial.covariate.trans
formationnone)
179Diggle 3-component
- Measurement error variance ?2
- Random intercept variance ?2
- Random process Wi(t) SGP(0, ?2?(?t ))
- ?(?t ) ??t, correlation function
- continuous-time AR(1)
180Diggle 3-component example
- Fit D3C model (almost) to rats data
- lme(logy group poly(Time,2),
- random 1, cluster Subject,
- serial.structure"ar1", datarats.ldf)
- Compare variance estimates with variogram
- plot(variogram(ksmooth.res(rats.ldf,
- log(y), bygroup, band0)))
181REML or ML
- ML estimation
- est.methodML
- REML estimation (default)
- est.methodRML
- Model comparison (LR test) with REML
- only when fixed models are the same!
182Model testing
- Use anova to compare models
- obj1 lt- lme(logy grouppoly(Time,2),
- 1, Subject, datarats.ldf)
- obj2 lt- lme(logy grouppoly(Time,2),
- 1Time, Subject, re.struct"diag",
- datarats.ldf)
- anova(obj1,obj2)
183Simplifying the variance model
- If you give a variance parameter a zero initial
estimate, it will be fixed at zero - Useful for testing whether a certain variance
component is necessary in the model - Use a Likelihood Ratio test
- Use CS with covariance0 for ID
184Parameter estimation a warning
- It can be difficult to get reliable variance
parameter estimates from mixed models - flat likelihoods (and no standard errors!)
- correlated parameters (and no est. correlations!)
- easily overparameterised (with no warning!)
- Restart your model with different initial
estimates to check convergence - start list(D matrix, alpha c())
185Day 3 Session 2
- Practical
- variograms
- Parametric correlation models
- Random effect models
186Day 3 Session 3
- Other topics
- NLME
- GEE
- dropout modelling
- alternating logistic regressions
- resources
- general discussion
187Nonlinear mixed effects
- nlme Allows a nonlinear model for with random
effects - gt Orange.plot()
- gt Orange.fit lt- nlme(
- circumference A/(1 exp((B - age)/C)),
- fixed list(A ., B ., C .),
- random list(A ., B .),
- cluster Tree, data Orange,
- start list(fixed c(160, 700, 350)),
- serial.structure "ar2",
- var.function "expon")
188Pharmokinetics example
- one-compartment open model with first-order
absorption and elimination - clearance ?1i, absorption ?2i, elimination ?3i
- randomly varying per patient i
- Theoph.plot()
189GEE
- Generalised Estimating Equations
- Simplest case
- EY X?
- VarY not specified
- working correlation matrix R
190GEE for non-normal data
- For non-Normal data
- h(EY) X?
- Link function h determined by dist. family
- familygaussian identity
- familybinomial logit
- familygamma inverse
- familypoisson log
- familyquasi(link, variance)