Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker. Jon Rasbash Centre for Multilevel Modelling University of Bristol - PowerPoint PPT Presentation

1 / 40
About This Presentation

Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker. Jon Rasbash Centre for Multilevel Modelling University of Bristol


Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker. Jon Rasbash Centre for Multilevel Modelling – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0


Transcript and Presenter's Notes

Title: Battling entropy: the development of the MLwiN statistical modelling package:the confessions of a well intentioned hacker. Jon Rasbash Centre for Multilevel Modelling University of Bristol

Battling entropy the development of the MLwiN
statistical modelling packagethe confessions of
a well intentioned hacker.Jon Rasbash Centre
for Multilevel ModellingUniversity of Bristol
The way it is.
Here is Edward Bear, coming downstairs now, bump,
bump, bump, on the back of his head.
It is, as far as he knows, the only way of coming
downstairs, but sometimes he feels that there
really is another way,
if he could stop bumping for a moment and think
of it.
And then he feels that perhaps there isnt.
Another relevant opening paragraph
A doctor, a civil engineer and a computer
scientist were arguing about what was the oldest
profession in the world. The Doctor said well in
the Bible it says that God created Eve from a rib
taken from Adam, clearly this required surgery so
my profession must be the oldest in the world.
The civil engineer interrupted but earlier in
the book of Genesis it says that God created the
order of the heavens and the earth out of chaos.
That was certainly a most spectacular feet of
civil engineering. So Doctor my profession is
The Computer Scientist smiled confidently who
do you think created the chaos
Grady Booch Object Orientated Analysis and
Origins of MLwiN
Mike Healys Nanostat(1981) a Minitab clone
written in RATFOR.
B. W. Kernighan, RATFOR -- A Rational Fortran,
Workshop on Fortran Preprocessors Pasadena
Calif., pp. 3, November 1974.
Mike wanted to do something on his Osborne
Portable Computer so he wrote NANOSTAT
NANOSTAT architecture
Like MINITAB data represented as a set of columns
Command verbs taking columns, numbers and boxes
as arguments
Commands can be strung together outputs from 1
command acting as inputs to another
A simple architecture involving a command parser,
functions to create columns and a series of a
hundred or so commands that take inputs create
outputs with no side-effects.
ML2,3,N DOS programs
We added capabilities to fit a two level
multilevel model in 1988 and called the program
ML3 was released in 1990 and the source code was
translated to C.
MLN was released in 1995 and the new N-level
algorithm was written in C.
MLN and C
The N-level computational algorithm(never
published) is a set of C classes for handling
problem specific highly patterned matrices. To
illustrate consider the model
One computationally intensive step in the IGLS
algorithm is to estimate the variances and
covariance of the random effects. Lets look at
what that involves-from a computing perspective
Estimating Equation for ?
But given block diagonality of V-1 this
simplifies to
Which greatly reduces computational load
Storage ? nj4 and flop counts are proportional
to nj3
if nj100 RAM requirements gt 100MB in the early
1990s this was not possible on PCs so
Exploiting patterns
All the large matrices were highly structured and
could be represented in terms of complex
expressions using smaller building block
matrices. Doing this reduces computation
Storage from ? nj4 to nj p and flop counts
from nj3 to nj p2
Creating the C matrix class hierarchy
In designing this class hierarchy I wanted to be
able to take expressions such as
and program them directly as thetainv(Zstarin
v(Vstar)Zstar)(zstarinv(V))ystar However, we
are working here in terms of the big matrices
which directly reflect statistical logic but are
hopelessly inefficient computationally. Each big
matrix is representend internally as a patterned
set of smaller rectangular and symmetric
matrices. The statistical logic can then be
expressed at an abstract level but the details of
storage and computation handled efficiently by
Code was fast and efficient and been pumping away
for over a decade.
But did C and OOD help? Not sure.
C syntax, compiler error messages and garbage
collection difficult.
For example, get some complex message about why a
variable could not be seen, when I thought I had
followed, C/OOD principles and syntax.
Then I think oh sod it Ill just make the
variable global breaks the encapsulation
Have not touched the code for at least 5 years
and have no intention of extending it.
I ignored advice dont do a new application and
learn C/OOD at the same time
How well do OOD, which conceptualises problems
around a series of communicating objects with
taxonomic relationships specified by
class-hierarchies work for the highly procedural
business of statistical algorithm development?
Would have helped to have a mentor with good
applied experience of OOP/OOD.
Is there a macho(or perhaps lawyer like) culture
lurking in software engineering?
COM example??
My early experience contacting computer scientists
In 1996 we begun work on a windows version of
Key difference between console based MLN and
windows based MLwiN
In MLN you only see something e.g. model setup,
graph, prediction, data, multilevel residuals,
model constraints, hypothesis tests etc when you
ask for them with a command.
In MlwiN all these interdependent objects can be
displayed simultaneously on screen in different
windows and an action changing one can have
effects on the objects viewed in all the other
windows and the other windows must be re-drawn.
We therefore require an architecture that passes
messages to windows when their displays have
become out of date the windows can then respond
by redrawing themselves as they see fit.
Objects responding to messages OOD paradigm.
MLwiN implementation
GUI front end written in VB. Turn command driven
console app from EXE to DLL.
Simultaneously we had an application into JISC
for a parallel and distributed processing version
of MLN/MLwiN. Where GUI runs on PC and
computation is done on a server or a grid.
This required minimising data transfer from GUI
to DLL handling the computation.
Recording system state and task processing
handled by the C DLL. The VB front end is a
view on the system(collecting input and
displaying output)
MLwiN architecture to handle simultaneous
interdependent displays and buffering of GUI/back
end data.
Action what data structures are set out of date
by the action Windowwhat actions effect it
register interest in actions
request action
send commands
Action manager (dispatcher)
request data
notify windows of action
copy data
Data buffers invalid flags(one per data item)
data invalid
C command driven program
Done with some help
Above architectural framework has worked well.
A friend, Bruce Cameron was hired as a project
consultant, to design the framework. We benefited
greatly from the input of an experienced software
engineer/system analyst.
Bruces input probably crucial to MLwiNs success
such as it is.
MLwiN 1.0 released in 1998.
The Equations window
One of the design features was to allow users to
work with statistical equations directly to
specify and explore multilevel models
This is because expository materials were all
based around equations representations and users
learning MM had a double whammy of understanding
how the equations operationalised the techniques
and then translating from that representation to
equations running the model and then back
translating text based tables of results to the
equation representation. This translation placed
an unnecessary cognitive load on learners.
Many quantitative social scientists were
resistant to equations. But the influential
quantitative social scientists loved it.
Equations window
An IO device that allows, via direct
manipulation, models to be specified and changed
and results to be viewed. An IO device embedded
in the statistical context. Not an open ended
declarative symbolic language processor.
ML regression model with random intercepts
already specified by pointing and clicking. To
extend to random slopes.
Programming the Equations window
The equations window was a great success but
extremely straight forward to implement. This was
because we had the right frameworks
  • VBs GUI programming model
  • Bruces synchronisation architecture.

The project in 1998 was joined by Bill Browne who
implemented MCMC algorithms for Multilevel Models
in MlwiN
Bill implemented special case, optimised code.
It became apparent that MCMC algorithms were
easier to extend to a wide range of statistical
models than the IGLS and other algorithms we had
been working with. Also these algorithms scaled
well in terms of computational load.
Bill worked with the Centre for Multilevel Models
from 1998-2003 much of his work on the program is
recorded in Browne, W.J. (2003). MCMC Estimation
in MLwiN (Version 2.0) Institute of Education
University of London
Extensibility problems
By 1999, although the architecture for the move
to windows was reasonably sound, another
architectural problem was coming into focus.
The software architecture reflecting the
representation of statistical models was ten
years out of date with new developments being
shoe-horned into the old architecture. A few
key differences over the decade
Normal Responses Hierarchical population
structures IGLS estimation
Normal, Poisson, Binomial, Multinomial responses
Hierarchical, crossed, multiple membership
IGLS, bootstrap and MCMC estimation
Time for a major redesign of the software
Update architecture to reflect new types of
models that we had developed
Make new model information structures estimation
method independent eg convenient to plug in IGLS,
MCMC, quadrature, SIM_ML, bootstrapping, AIP.
Current model structures IGLS-centric.
A central strand of statistical analysis is the
process of working through a series of models and
comparing them. Update software architecture to
support multiple live statistical models.
Create an object model of the objects that are
the stuff of statistical modelling data, models,
estimates, predictions, graphs, estimation
engines etc
Design in interoperability with other software
(via COM, CORBA)
A big task-could UML help?
After reading quite a bit of Grady Booch and
other 3-Amigo texts I got excited about using UML
and OOD to help us implement the next generation
of the MlwiN software.
I thought this is a great opportunity to learn OO
design and process skills and bring some much
needed rigour, clarity and good practice to our
software design and development procedures.
I set to work
A year later I crumpled into a heap and simply
could not continue.
What went wrong?
UML helping communication
A key feature claimed for the UML diagrams is
that they serve as representation that software
developers and application experts(statisticians)
can use to communicate reasonably unambiguously.
This helps ensure that the developers build the
system the application experts want and that the
objects in the system(and their
inter-relationships) correspond to objects in the
application knowledge domain facilitating
When I tried to use UML diagrams to talk about
statistical structures and processes to
statisticians I found they got in the way. This
could be due to my in-expert use of the diagrams.
They got frustrated and I got defensive.
Lost in the process
I got lost in the UML multi-phase, iterative
process. Had I spent enough time developing
use-cases? Should I now move on to static class
diagrams? How detailed should they be at this
stage? Have I got the fundamental class design
right? Would these interaction diagrams be useful
now? And what exactly was this Rational Unified
Process anyway?
First of all, I thought if I read enough, I would
be able to get things clear. Which seemed to
work. Until I tried to apply what I had read.
Then I thought, well Ill just plough on anyway
and it will become clear through doing. Oh I am
still confused better go back and read some more.
After a year of this I had failed to produce a
single line of code.
Not another bloody ticket sales application.
All the UML texts used airline ticket sales or
loyalty card schemes as their exemplars
hundreds of pages for a single worked example
sometimes. But I found it hard to transpose those
exemplars onto using UML to design/implement a
statistical modelling system.
A victim of hype
Although the UML texts contain statements like
there is no silver bullet. They are very
persuasive, they are selling a methodology and in
the case of Rational Rose, software products to
go with the methodology.
Some stronger health warnings on the packet might
have been helpful and also some case studies of
where and why UML failed.
Mentor required
In hindsight I realised that I needed a mentor to
guide me through the process.
Mea Culpa I could have sought out a mentor but
I had the feeling that I really better clarify
things a bit before I seek help from an expert. A
possibly fatal lack of confidence on my part.
Friendly, accessible experts required.
Current development strategy for new statistical
We are currently developing MCMC estimation
models for
Multilevel latent category models(aka growth
Multilevel mover/stayer models
Multilevel factor analysis and structural
equation models
Multilevel multivariate response models with
response of different types defined at different
models useful for simultaneous equation models,
multiprocess models, causal models. And as an
engine for multiple imputation for missing data.
All these models are being developed in MATLAB
MATLAB as a prototypingenvironment
Relevant MATLAB features
Excellent features for matrix programming thus
good for prototyping algorithms.
A GUI RAD programming framework (combos, slider,
buttons, radio boxes, check boxes, textboxes,
menus, list box, button group, panel with all the
obvious event hooks defined. If that is not
enough a container for any activeX control.
Render tex strings into equations.
Excellent external interface to other systems
DLL(with extensive examples for C and FORTRAN),
The MATLAB compiler will translate a set of .m
files to C or C, compile and link them. This
allows easy creation of a royalty free EXE or a
Development process
For each new model to be implemented
C DLL interfaced to MLwiN
Develop algorithms, model set-up
interface, model output display and model
diagnostic devices in MATLAB
Pass results back to MLwiN structures
MLwiN model appears on MlwiN menu.
Are we using MATLAB as a development engine or a
prototyping environment?
Code is not as fast as handcrafted C/C. By
about an order of magnitude.
Architecture is a little piecemeal, treating each
new model type as as a separate entity. Lacks
extensibility. What happens if you want to
combine model types?
However 2 project programmers and two
statisticians have a very immediate need to learn
MCMC and this provides a good platform for that.
As team members develop a better understanding of
MCMC we can then think about a more general,
extensible architecture.
MCMC learning group
We are seeking funding to set up an MCMC learning
group which will be for a group of about 10
people associated with team mathematical
statisticians, programmers/software engineers and
applied social statisticians.
Group will use an online learning environment and
work through simple to more complex models using
MCMC estimation. Covering
MCMC estimation theory for each model
Implementation in MATLAB, handcrafted C/C
routines, BUGS and openBUGS
Applications of the model to substantive problems.
Outputs of the learning group
A better understanding across the team of the
potential of MCMC estimation.
Better understanding of computing issues for the
specification, estimation and interpretation of
statistical models using MCMC.
This increased understanding will guide decisions
on a future more general architecture. Which
could be for MLwiN to become a front end for
OpenBUGS. Provides general model specification
structures and access to samplers for nodes.
Leaving a learning ladder for others to follow.
Recently received funding to do some statistical
methodology development but mostly capacity
building for social scientists.
From our workshops we know that many quantitative
social science researchers in government and
academic departments dont understand the
mechanics of a multiple regression equation with
interactions between continuous and categorical
We are thinking hard about who we can target for
progression, where conceptually and socially(e.g.
work environment) they are getting stuck and what
software tools, training materials and formats
they need.
The architecture of the learning environment we
are developing could be the subject of another
whole presentation.
A cross-disciplinary model for development?
Software and training materials
Standards for statistical model representation
Many tools exist for transferring primary data
between proprietary formats and existing
standards. However no standards exist for the
secondary data of statistical model structure and
no tools exist for transferring between
proprietary standards for representing model
structure. (Some exceptions in data mining).
Development of a cross-platform language
independent component for storing model
specification is highly desirable....
Standard Model component
Generic statistical model representation
Model?EE interface
Usual advantages of component based design
New EE algorithms can be plugged into the model
making comparison of EE much easier good
Different data sources e.g. EXCEL, SAS etc
worksheets can easily be bound to the model.
Alternative GUI devices can be plugged into the
model for developing model specification and
exploration tools.
Facilitates collaborative working.
Is graphical modelling a good framework to use to
build the model representation component?
Its papers not programs, stupid.
Software engineering not credited. ESRC for many
years explicitly did not fund it. They had a
policy of prototyping only and leaving commercial
outfits to exploit and further develop into
widely usable systems. Misguided interaction
between software engineers, statisticians and
applied researchers crucial. Commercial outfits
take too long to respond. We sneaked in under
the radar.
Now changing with rising profile of GRID and
Academic environment produces organic rather than
structured development.
Software engineering can be very valuable but
software modelling techniques can be complex and
easy to get lost in. Again good
cross-disciplinary communication required.
Write a Comment
User Comments (0)