Title: MLwiN a few pointers
1MLwiN a few pointers
In this session we give a little guidance as to
how MLwiN works for some generic functions data
import, data manipulation and graphing.
Using MlwiN for statistical modelling is not the
topic of this session, that will be covered in
later practical sessions.
2Data import
The latest version of MLwiN can directly import
and export STATA, MINITAB and SPSS. These options
are located on the File Menu(Open and Save
worksheet)
Sometimes it is convenient to paste data in from
the clipboard, see http//www.cmm.bristol.ac.uk/ML
wiN/tech-support/support-faqs/data-in/index.shtml
TIP dont import vast numbers of variables.
Concentrate on the ones you need. Usually
somewhere between 5 and 50 is appropriate.
3Commands and GUI
MLwiN has a Graphical User Interface(GUI) and a
scripting language
In this workshop we will use the GUI almost
exclusively. However, some enthusiasts may want
to learn the scripting language. It is
documentented in the Help system. A slightly out
of date command manual can be downloaded
from http//www.cmm.bristol.ac.uk/MLwiN/download/m
anuals.shtml
Many people use the package of their choice for
data management and then import to MLwiN for
multilevel modelling and visualations.
Other people use MlwiN for data management and
preparation as well as multilevel analysis.
4Opening the software
Menus with options for data manipulation,
graphics and modelling
Options for controlling model estimation
Work area where windows for specific tasks appear
Progress reporting
5Opening a worksheet
Variable summary window called the Names window
always appears
6Names window-one row per variable
Toggle selected variable categorical/numeric
View data for selected variables
View/edit category names
Naming/renaming vaiables
Enter additional descriptive text for variable
View only columns containing data
Defining a variable as categorical or numeric
effects how the variable is treated when entered
into a model
7Data Manipulation Menu
Many common data manipulation, viewing and
editing tasks can be achieved via windows brought
up the Data Manipulation menu. For example
selecting Data Manipulation?View or Edit Data
The data window with one row per Case
Note this data is arranged as a flat file with
values for school level variables repeated
But be careful
8MLwiN is not a strict record based system
MlwiN has, by default, 1500 columns that can
contain variables. Can be increased by selecting
Worksheet from the Options menu
MLwin is not a record based system it will allow
variables to be of different lengths in a
worksheet. It is like a spread sheet in this
respect. Be careful variables can get out of
alignment
9Getting things out of alignment
For example if we sort on variable Normexam alone
it is out of alignment
Clicking undo button returns the data
10Carrying Re-ordering and selection operations
When we re-order a variable or change the length
of a variable we must explicitly carry all
other variables that are linked horizontally in
the same record.
The sort window has a similar structure to many
of the data manipulation windows in MlwiN. So
carrying with the sort window, to ensure
everything stays lined up, looks like this
11Carrying..
Left panel defines operation
Variables to be carried
Destination columns, same as input or new
Buttons for removing particular actions, removing
all actions, executing the actions, or undoing
the actions
Right panel lists actions requested
NOTE UNDO only available while the selected
data manipulation window for the task at hand(in
this case sort) is open, so inspect data before
closing the the particular data manipulation
window you are using.
12Missing Data
If you have a single value coding for missing
data, you can set this via Options?Numbers
If you are importing a STATA, SPSS or MINITAB
worksheet, MLwiN should recognise the system
missing codes from these software packages
If you paste data in and a variable has a unique
non-numeric code sequence such as or .
or ??? the unique non-numeric code will be
interpreted as missing.
13Graphics
MLwiN has a range of graphical functionality,
including various types of static and
interactive visualisation to aid in model
interpretation and exploration. For the moment we
review some of the standard graphics available
via the Graphs?Customised Graphs window
14Three layers of graphics output available
They can be specified by the customised graphs
window. Firstly a Display. A Display can contain
multiple graphs. A Display can contain up to 25
graphs. Only one display can be viewed at a time.
But you can have up to ten displays and switch
between them.
Secondly, a graph. There are two in this display
Finally, a graph can contain multiple data sets.
A data set is a single variable(for histograms)
and an x-y pair of variables for other plots. The
left graph is a histogram of the variable
normexam. The right graph contains two data sets.
Ynormexam, Xstandlrt (in blue)and Ynormexam
Xgender(in red)
15Filling out the customised graph window
16Save your worksheet
If you save a worksheet, it saves the data and
any current graphs and the current statistical
model and its results and any tables of results
from multiple models you have built.
Save your worksheet regularly. Regard anything
unsaved as hostage to fortune either a system
crash or a user mistake that leads to data
being irreversibly re-arranged eg deleting
columns, getting variables out of alignment.
17Multiple regression a refresher
In this and other sessions we will be using data
from the 2002 European Social Surveys (ESS).
Measures of ten human values have
been constructed for 20 countries in the European
Union.
We will study one of the ten values, hedonism,
defined as the pleasure and sensuous
gratification for oneself.
The scores on the hedonism variable range from
-3.76 to 2.90, where higher scores indicate more
hedonistic beliefs.
In this session we consider the application of
multiple regression to a subset of the data for
three countries onlyUK, Germany and France
Hedonism is taken as the outcome variable in our
analysis. We consider three explanatory
variables Age in years Gender (coded 0 for
male and 1 for female) Country (coded 1 for the
UK, 2 for Germany and 3 for France)
18Regression with a single continuous explanatory
variable
Line of best fit through the data
Ordinary least squares estimates ?0 and ?1 to
minimise the sum of the squared values of ei
19Terminology
Y response variable, outcome variable,
dependent variable
X explanatory variable, predictor variable,
independent variable
20Linear regression with a continuous predictor
Research questions
Is there an association between y and x?
For example in the values data set is there an
association between hedonism( y) and age( x)
21Interpretation
For every year increase in age hedonism decreases
by 0.018 units
At age0(x0) the average hedonism level is
0.712. The notion of the hedonism score of a
newly born baby, where hedonism is measured by
answers to survey questions put to people in the
age range 14..98 years is not very meaningful.
22Centering
When an x value of 0 is outside the range of x
and therefore the interpretation of the intercept
is not meaningful, people often center the x
variable. In our data set we can center age
around its average value of 46 years. This gives
intercept and slope estimates of
Note that centering a predictor variable does not
change the estimate of the slope or the position
of the regression line through the data
23Linear regression with a continuous explanatory
variable Assumptions
1. Independence. The residuals(ei) are assumed to
be independent of each other. This means that
knowing the value of the residual for one person
tells us nothing about the value of a residual
for any other person. The residuals are assumed
to be independent of x. That is cov(xi , ei)0.
2. The residuals follow a Normal distribution
that is
3. The variance of the residuals is constant wrt
to x. This is known as homoskedasticity.
24Constant variance assumption
-3 -2 -1 0 1 2
3
Residuals variance constant wrt to x
homoskedasticity
Residuals variance not constant wrt to x
heteroskedasticity
25Checking the model assumptions
We can evaluate the validity of assumption 2) and
3) by use of diagnostic
Assumption 2 Normality. Standardised residuals
plotted against Normal scores of standardised
residuals should lie on a straight line
Assumption 3 Constant variance. Vertical scatter
of points should be roughly the same for any
value of x
Assumption 1 If we suspect residuals are not
independent of each other then we can fit more
complex models to test this for example a
multilevel model.
26Hypothesis testing p values
Null hypothesis that there is no relationship
between hedonism and age in the population(?1
0) and the relationship we observe in the sample
could have arisen by chance.
Alternative hypothesis there is a relationship
in the population(?1 ? 0).
The standard error is a measure of the
imprecision of our estimates (as the standard
error gets smaller the precision of our estimates
increases). In our example SE(?1)0.001. We can
look at Z or t ratio
Which yields a p-value 0.001. Which says if there
were no relationship in the population between
hedonism and age we would expect less than 0.1
of samples to produce a slope estimate of
magnitude greater than 0.018.
Note that the SE decreases with n so that with
large enough samples any effect becomes
statistically significant.
27Hypothesis testingconfidence intervals
Alternatively, but equivalently, we can construct
a 95 CI for ?1
Zero (the value of ß1 under the null hypothesis)
is well outside the 95 confidence interval, so
we reject the null hypothesis and conclude that
the relationship is statistically significant at
the 5 level. Note -1.96 and 1.96 are the 2.5
and 97.5 points on a standard Normal
distribution.
28Comparing groups regression with a single
categorical predictor
Suppose we fit the regression model
where yi is the hedonism score of individual i,
and xi 1 if the individual is female, and 0 if
the respondent is male. We then obtain
The predictions for men and women are
The difference between men and women has a
z-ratio of -0.156/0.025 and we would reject the
null hypothesis of the male and female means
being equal and the 95 CI for ?1 is
(-2.06,-0.106).
29Comparing groups with more than two categories
We used two different parameterisations to
estimate the two gender means. Generally, the
first parameterisation, where the intercept is
multiplied by a constant vector of 1s is
preferred. This is because when we add multiple
predictors into the model, interpretation of the
coefficients is more straightforward.
For every extra category in a predictor variable
we need to include an extra indicator or dummy
variable in our model. With an n-category
variable we need to include n-1 indicator
variables in addition to the intercept term to
model the means of the n groups. For example to
model the three country means(UK, France and
Germany). We can fit the model
30Country difference in hedonism
For UK residents (Germany0, France0) Predicted
hedonism -0.384 (0.256 0) (0.492 0)
-0.384 For German residents (Germany1,
France0) Predicted Hedonism -0.384 (0.256
1) (0.492 0) -0.128 For French residents
(GERM0, FRANCE1) Predicted Hedonism -0.384
(0.256 0) (0.492 1) 0.108
31Hypothesis testing for categorical predictors
with more than two groups
What if we want to test the France/Germany
difference?
We could reparameterise the model so that
Germany, instead of the UK, was the reference
category.
Or we could conduct a wald test on the equality
32More than one predictor variable-statistical
control
When modelling the effects of the country
predictor variable we already entered multiple
dummy or indicator variables.
We can add multiple predictor variables into our
model, where categorical predictor variables will
be handled by a set of dummy variables and
continuous predictor variables will be handled by
including the variable directly
Once we include more than one predictor variable
our model can address the issue of statistical
control
Does the association of one predictor variable
with the response persist when we simultaneously
account for further predictor variables?
33Example of statistical control with the hedonism
data
We have already seen that
Women are less hedonistic than men
Hedonism decreases with age
However, women live longer than men. So some of
the gender gap will be due to the fact that women
are on average older than men.
Some but how much?
We can answer this question by fitting age and
gender in the same model.
This will tell us if the gender gap persists
after controlling for age.
34Modelling gender and age simultaneously
The gender effect in the model where gender is
only the predictor variable is -0.156. So the
gender effect persists strongly after controlling
for age.
35Statistical control another example
Imagine attainment scores on two schools
But controlling for prior ability
Fitting school as a single predictor
School B has -ve effect
School B has ve effect
36Interactions between predictor variables
Recall our model with age and gender effects..
It may be that the gender gap changes as a
function of age Or equivalently The age slope is
not the same for men and women.
We can test for this by including an interaction
between age and female as an extra explanatory
variable in the model
We do this by including a variable that is the
product of age and female
37Gender x Age interaction effects
Results
1 1 10
10 1 0 15
0 .
This gives a prediction line for males( female0)
of -0.058 0.153?0 - 0.019agei 0.02 ? 0 ?
agei -0.058 0.019agei and for
females(female1) of -0.058 0.153?1 -
0.019agei 0.02 ? 1 ? agei(-0.058-0.153)(-0.01
90.002)agei
That is females have an intercept -0.153 lower
than males and a slope 0.002 greater than males
Note the gender difference in the slopes 0.002
has a z-ratio of 2 so is just statistically
significant at the 5 level
38Graphing the lines
male 0.058 0.019agei
female (-0.058-0.153)(-0.0190.02)agei
The slightly flatter (less negative) slope for
females means the gender gap decreases with age
39Examining the gender gap
We may want to know does the gender gap remain
statistically significant even at higher ages
when it is diminished?
The gender gap is
So we can plot this function out with its
associated confidence envelope and see for which
ages the confidence interval does not include
0(no gender gap)
40Graphing the gender gap with 95 CI
The gender gap becomes statistically
insignificant at age-46 30 that is at 76 years