Research Centre for Vocational Education presentation

About This Presentation

Transcript and Presenter's Notes

Title: Research Centre for Vocational Education

1
Research Centre for Vocational Education
Bayesian Modeling in Educational Research
Petri Nokelainen petri.nokelainen_at_uta.fi
v2.2 This lecture is available at
http//www.uta.fi/aktkk/lectures/bayes_en
2
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization

3
Research Overview
VOCATIONAL EDUCATION
COMPUTER SCIENCE
Professional growth and learning
Applied research
Pekka Ruohotie Petri Nokelainen
Online pedagogy
Authentic use of educational technology
4
Research Overview
RCVE Data Server - On-line Questionnaire
BayMiner - Bayesian Supervised and Unsupervised
Visualization
2008
1997
5
Research Overview
1. BAYESIAN MODELING -gt Linear vs. non-linear
dependencies -gt Dependency and
classification modeling
(B-COURSE) -gt Unsupervised model-based
visualisation (BAYMINER)
6
Linear and Non-linear Dependencies
7
Bayesian Classification Modeling
The classification accuracy of the best model
found is 83.48 (58.57).
COMMON FACTORS PUB_T CC_PR CC_HE PA C_SHO C_FAIL
CC_AB CC_ES
8
Bayesian Dependency Modeling
9
Bayesian Unsupervised Model-based Visualization
10
Current Research

1. Bayesian track
Normalized Maximum Likelihood (NML) test
Bayesian dependency modeling with latent
variables
Visualization of Bayesian networks (Banex)
2. Educational technology track
Proactive search to help learners to manage
information online
Empirical validation of the technical and
pedagogical usability criteria of digital
learning material

11
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification modeling
Bayesian Dependency modeling
Bayesian Unsupervised Model-based Visualization

12
Putting Bayesian Techniques on the Map
13
Introduction to Bayesian Modeling

Probability is a mathematical construct that
behaves in accordance with certain rules and can
be used to represent uncertainty.
Bayesian inference uses conditional probabilities
to represent uncertainty.
P(M D,I) - the probability of unknown things
(M) given the data (D) and background information
(I).

14
Introduction to Bayesian Modeling

The essence of Bayesian inference is in the rule,
known as Bayes' theorem, that tells us how to
update our initial probabilities P(M I) if we
see data D, in order to find out P(M D,I).

A priori probability
Conditional probability
Posteriori probability

P(DM) P(M) P(MD) P(DM)P(M
) P(DM) P(M)
15
Introduction to Bayesian Modeling

Bayesian inference comprises the following three
principal steps
(1) Obtain the initial probabilities P(M I) for
the unknown things. (Prior distribution.)
(2) Calculate the probabilities of the data D
given different values for the unknown things,
i.e., P(D M, I). (Likelihood.)
(3) Finally the probability distribution of
interest, P(M D, I), is calculated using
Bayes' theorem. (Posterior distribution.)
Bayes' theorem can be used sequentially.

16
Introduction to Bayesian Modeling

Bayesian method
(1) is parameter-free and the user input is not
required, instead, prior distributions of the
model offer a theoretically justifiable method
for affecting the model construction,
(2) works with probabilities and can hence be
expected to produce robust results with discrete
data containing nominal and ordinal attributes,
(3) has no limit for minimum sample size,
(4) is able to analyze both linear and
non-linear dependencies,
(5) assumes no multivariate normal model.

17
C_Example 1 Applying Bayes Theorem

Company A is employing workers on short term
jobs that are well paid.
The job sets certain prerequisites to applicants
linguistic abilities and looks.
Earlier all the applicants were interviewed, but
nowadays it has become an impossible task as both
the number of open vacancies and applicants has
increased enormously.
Personnel department of the company was ordered
to develop a questionnaire to preselect the most
suitable applicants for the interview.

18
C_Example 1 Applying Bayes Theorem

Psychometrician who developed the instrument
estimates that it would work out right on 90 out
of 100 applicants, if they are honest.
We know on the basis of earlier interviews that
the terms (linguistic abilities, looks) are valid
for one per 100 person living in the target
population.
The question is If an applicant gets enough
points to participate in the interview, is he or
she hired for the job (after an interview)?

19
C_Example 1 Applying Bayes Theorem

A priori probability P(H) is described by the
number of those people in the target population
that really are able to meet the requirements of
the task (1 out of 100 .01).
Counter assumption of the a priori is P(H) that
equals to 1-P(H), thus it is .99.
Psychometricians beliefs about how the instrument
works is called conditional probability P(EH)
.9.
Instruments failure to indicate non-valid
applicants, i.e., those that are not able to
succeed in the following interview, is stated as
P(EH) that equals to .1.
These values need not to sum to one!

20
C_Example 1 Applying Bayes Theorem

A priori probability
Conditional probability
Posterior probability

P(EH) P(H)
P(HE)
P(EH) P(H) P(EH) P(H)

(.9) (.01) P(HE) (.9)
(.01) (.1) (.99)
.08
21
C_Example 1 Applying Bayes Theorem
22
C_Example 1 Applying Bayes Theorem

What if the measurement error of the
psychometricians instrument would have been 20
per cent?
P(EH)0.8 P(EH)0.2

23
C_Example 1 Applying Bayes Theorem
24
C_Example 1 Applying Bayes Theorem

What if the measurement error of the
psychometricians instrument would have been only
one per cent?
P(EH)0.99 P(EH)0.01

25
C_Example 1 Applying Bayes Theorem
26
C_Example 1 Applying Bayes Theorem

Quite often people tend to estimate the
probabilities to be too high or low, as they are
not able to update their beliefs even in simple
decision making tasks when situations change
dynamically (Anderson, 1995, p. 326).

27
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

One of the most important rules educational
science scientific journals apply to judge the
scientific merits of any submitted manuscript is
that all the reported results should be based on
so called null hypothesis significance testing
procedure (NHSTP) and its featured product,
p-value.
Gigerenzer, Krauss and Vitouch (2004, p. 392)
describe the null ritual as follows
1) Set up a statistical null hypothesis of no
mean difference or zero correlation. Dont
specify the predictions of your research or of
any alternative substantive hypotheses
2) Use 5 per cent as a convention for rejecting
the null. If significant, accept your research
hypothesis
3) Always perform this procedure.

28
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

A p-value is the probability of the observed data
(or of more extreme data points), given that the
null hypothesis H0 is true, P(DH0) (id.).
The first common misunderstanding is that the
p-value of, say t-test, would describe how
probable it is to have the same result if the
study is repeated many times (Thompson, 1994).
Gerd Gigerenzer and his colleagues (id., p. 393)
call this replication fallacy as P(DH0) is
confused with 1P(D).

29
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

The second misunderstanding, shared by both
applied statistics teachers and the students, is
that the p-value would prove or disprove H0.
However, a significance test can only provide
probabilities, not prove or disprove null
hypothesis.
Gigerenzer (id., p. 393) calls this fallacy an
illusion of certainty Despite wishful thinking,
p(DH0) is not the same as P(H0D), and a
significance test does not and cannot provide a
probability for a hypothesis.
A Bayesian statistics provide a way of
calculating a probability of a hypothesis
(discussed later in this section).

30
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

I have been teaching elementary level statistics
for educational science students for the last 12
years.
My latest statistics course grades (Autumn 2006,
n 12) ranged from one to five as follows 1) n
3 2) n 2 3) n 4 4) n 2 5) n 1,
showing that the lowest grade frequency (1) from
the course is three (25.0).
Previous data from the same course (2000-2005)
shows that only five students out of 107 (4.7)
had the lowest grade.
Next, I will use the classical statistical
approach (the likelihood principle) and Bayesian
statistics to calculate if the number of the
lowest course grades is exceptionally high on my
latest course when compared to my earlier stat
courses.

31
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

There are numerous possible reasons behind such
development, for example, I have become more
critical on my assessment or the students are
less motivated in learning quantitative
techniques.
However, I believe that the most important
difference between the last and preceding courses
is that the assessment was based on a computer
exercise with statistical computations.
The preceding courses were assessed only with
essay answers.

32
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

I assume that the 12 students earned their grade
independently (independent observations) of each
other as the computer exercise was conducted
under my or my assistants supervision.
I further assume that the chance of getting the
lowest grade (?), is the same for each student.
Therefore X, the number of lowest grades (1) in
the scale from 1 to 5 among the 12 students in
the latest stat course, has a binomial (12, ?)
distribution X Bin(12, ?).
For any integer r between 0 and 12,

33
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

The expected number of lowest grades is 12(5/107)
0.561.
Theta is obtained by dividing the expected number
of lowest grades with the number of students
0.561 / 12 ? 0.05.
The null hypothesis is formulated as follows H0
? 0.05, stating that the rate of the lowest
grades from the current stat course is not a big
thing and compares to the previous courses rates.
Three alternative hypotheses are formulated to
address the concern of the increased number of
lowest grades H1 ? 0.06 H2 ? 0.07 H3 ?
0.08.

34
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

To compare the hypotheses, we calculate binomial
distributions for each value of ?.
For example, the null hypothesis (H0) calculation
yields

35
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

The results for the alternative hypotheses are as
follows
PH1(3.06, 12) ? .027
PH2(3.07, 12) ? .039
PH3(3.08, 12) ? .053.
The ratio of the hypotheses is roughly 1223
and could be verbally interpreted with statements
like the second and third hypothesis explain the
data about equally well, or the fourth
hypothesis explains the data about three times as
well as the first hypothesis.

36
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

Lavine (1999) reminds that P(r?, n), as a
function of r (3) and ? .05 .06 .07 .08,
describes only how well each hypotheses explains
the data no value of r other than 3 is relevant.
For example, P(4.05, 12) is irrelevant as it
does not describe how well any hypothesis
explains the data.
This likelihood principle, that is, to base
statistical inference only on the observed data
and not on a data that might have been observed,
is an essential feature of Bayesian approach.

37
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

The Fisherian, so called classical approach to
test the null hypothesis (H0 ? .05) against
the alternative hypothesis (H1 ? gt .05) is to
calculate the p-value that defines the
probability under H0 of observing an outcome at
least as extreme as the outcome actually
observed

38
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

After calculations, the p-value becomes ? 0.02
and would suggest H0 rejection, if the rejection
level of significance is set at 5 per cent.
Another problem with the p-value is that it
violates the likelihood principle by using P(r?,
n) for values of r other than the observed value
of r 3 (Lavine, 1999)
The summands of P(4.05, 12), P(5.05, 12), ,
P(12.05, 12) do not describe how well any
hypothesis explains observed data.

39
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

A Bayesian approach will continue from the same
point as the classical approach, namely
probabilities given by the binomial
distributions, but also make use of other
relevant sources of a priori information.
In this domain, it is plausible to think that the
computerized test would make the number of total
failures more probable than in the previous times
when the evaluation was based solely on the
essays.
On the other hand, the computer test has only 40
per cent weight in the equation that defines the
final stat course grade .3(Essay_1)
.3(Essay_2) .4(Computer test)/3 Final grade.

40
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

Another aspect is to consider the nature of the
aforementioned tasks, as the essays are distance
work assignments while the computer test is to be
performed under observation.
Perhaps the course grades of my earlier stat
courses have a narrower dispersion due to
violence of the independent observation
assumption?
For example, some students may have copy-pasted
text from other sources or collaborated without a
permission.
As we see, there are many sources of a priori
information that I judge to be inconclusive and,
thus, define that null hypothesis is as likely to
be true or false.

41
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

This a priori judgment is expressed
mathematically as P(H0) ? 1/2 ? P(H1) P(H2)
P(H3). If I further assume that the alternative
hypotheses H1, H2 or H3 share the same likelihood
P(H1) ? P(H2) ? P(H3) ? 1/6.
These prior distributions summarize the knowledge
about ? prior to incorporating the information
from my course grades.

42
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

An application of Bayes' theorem yields

43
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

Similar calculations for the alternative
hypotheses yields P(H1r3) ? .16 P(H2r3) ?
.29 P(H3r3) ? .31.
These posterior distributions summarize the
knowledge about ? after incorporating the grade
information.
The four hypotheses seem to be about equally
likely (.30 vs. .16, .29, .31).
The odds are about 2 to 1 (.30 vs. .70) that the
latest stat course had higher rate of lowest
grades than 0.05.

44
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

The difference between the classical and Bayesian
statistics would be only philosophical
(probability vs. inverse probability) if they
would always lead to similar conclusions.
However, in this case the p-value would suggest
rejection of H0 (p .02), but the Bayesian
analysis indicate not very strong evidence
against ? .05, only about 2 to 1.

45
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach

What if the number of the lowest grades is two?
The classical approach would not anymore suggest
H0 rejection (p .12).
Bayesian result would stay quite the same (.39
vs. .17, .20, .24), saying that there is not much
evidence against the H0.

46
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization

47
Investigating the Number of Non-linear and
Multi-modal Relationships Between Observed
Variables Measuring A Growth-oriented Atmosphere
A_Example 1

Petri Nokelainen
Pekka Ruohotie
Research Centre for Vocational Education (RCVE)
University of Tampere, Finland
Tomi Silander
Complex Systems Computation Group (CoSCo)
Helsinki University of Technology, Finland
Henry Tirri
Nokia Research Center, Finland

See printed version!
(Nokelainen, Silander, Ruohotie Tirri, 2007,
2003)
48
Introduction
A_Example 1

In the social science researchers point of view,
the requirements of traditional frequentistic
statistical analysis are very challenging.
For example, the assumption of normality of both
the phenomena under investigation and the data is
prerequisite for traditional parametric
frequentistic calculations.

49
Introduction
A_Example 1

In situations where a latent construct cannot be
appropriately represented as a continuous
variable, or where ordinal or discrete indicators
do not reflect underlying continuous variables,
or where the latent variables cannot be assumed
to be normally distributed, traditional Gaussian
modeling is clearly not appropriate.
In addition, normal distribution analysis sets
minimum requirements for the number of
observations, and the measurement level of
variables to be continuous.

50
Introduction
A_Example 1

Bayesian modeling approach is a good alternative
to traditional frequentistic statistics as it is
capable of handling small discrete, non-normal
samples as well as large-scale continuous data
sets.
The purpose of this paper is to investigate the
number of non-linear and multi-modal
relationships between variables in various
real-world empirical Growth-oriented Atmosphere
data in order to find how much they weaken the
robustness of linear statistical methods.

51
Research Questions
A_Example 1

What kind of non-linearities and how many are
captured by discrete Bayesian networks?
Is there difference between the results of linear
bivariate correlations and Bayesian networks?
Does an empirical sample containing pure linear
dependencies have better overall fit indices in
CFA than sample containing less linear
dependencies?
Does an empirical sample containing pure linear
dependencies have higher CFA parameter estimates
than sample containing less linear dependencies?
Are discrete Bayesian networks viable way to
pre-model data before CFA?

52
Types of Non-linearities Studied
A_Example 1

We study two different kinds of
"non-linearities",
Non-linear relationships between continuous
variables and
Multi-modal relationships between continuous
variables.
Further, we only study simple non-linear
relationships between two variables
The dependency between variables X and Y is
considered non-linear if the mean of the
conditional distribution of Y is not a monotonous
(i.e., increasing or decreasing) function of X.
Similarly, the dependency between variables X and
Y is considered multi-modal if the mode of the
conditional distribution of Y is not a monotonous
function of X.

53
Bayesian Dependency Models
A_Example 1

This study resembles to some extent the work by
Hofmann and Tresp in which they use method of
Parzen windows to allow non-linear dependencies
between continuous variables.
The emphasis in their work was to demonstrate the
possibility to build Bayesian networks that can
capture non-linear relationships.
By using discretized variables this possibility
comes trivially, but our objective is to find out
to what extent this possibility is used i.e. how
many and what kind of non-linearities are
captured by discrete Bayesian networks.

54
Bayesian Dependency Models
A_Example 1

Given the identically and independently
distributed multivariate data set D over
variables V and the prior probability
distribution ? over Bayesian networks, Bayesian
probability theory allows us to calculate the
probability P(G D, ?) of any Bayesian network
G.
Different networks can then be compared by their
probability.
Finding the most probable Bayesian network for
any given data is known to be NP-hard, which
practically ruins the hopes for the automatic
discovery of the most probable network.
However, stochastic search methods have proven to
be successful in finding high probability
networks Once the network G has been constructed
using data D, we can use it to calculate
predictive joint distributions P(V G, D).

55
Bayesian Dependency Models
A_Example 1

Bayesian network structure can be used to
effectively calculate conditional marginals of
the predictive joint distribution for single
variables i.e. P(Vi A, G, D), where A is any
subset of variables of V.
In this paper we only study the marginals, where
A is a singleton Vj and there is either an
arrow from Vi to Vj or an arrow from Vj to Vi (we
say that Vi and Vj are adjacent in G).

56
Linear and Non-linear Dependencies
A_Example 1

Frequentistic parametric statistical techniques
are designed for normally distributed (both
theoretically and empirically) indicators that
have linear dependencies.
Univariate normality
Multivariate normality
Bivariate linearity

57
Linear and Non-linear Dependencies
A_Example 1
rP -1.00
58
Linear and Non-linear Dependencies
A_Example 1

Sometimes univariate/multivariate assumption is
true, but bivariate linearity is violated.

59
Linear and Non-linear Dependencies
A_Example 1
rP -.39
60
Linear and Non-linear Dependencies
A_Example 1

In some cases, univariate normality is violated,
resulting in linear dependency.

61
Linear and Non-linear Dependencies
A_Example 1
rP.77
62
Linear and Non-linear Dependencies
A_Example 1

In some cases, univariate normality is violated,
resulting in non-linear dependency.

63
Linear and Non-linear Dependencies
A_Example 1
rP.59
64
Linear and Non-linear Dependencies
A_Example 1
rP.59
65
Measuring Non-linearities
A_Example 1

To measure non-linear dependencies captured by
Bayesian networks, we tested every variable in
each network by conditioning it one by one with
its immediate neighbors in the network.
We then observed whether the modes and means of
the conditional distributions were "linear" and
whether the conditional distributions were
"unimodal".
Linearity of modes and means was tested by
recording whether the means and modes were
increasing or decreasing functions of
conditioning variable.
Even clear departures from line like behavior
were accepted as linear as long as the direction
of correlation (positive, negative) did not
change.

66
Measuring Non-linearities
A_Example 1

In these experiments, linear means relationship
that can be more or less adequately modeled by a
line describing how central tendency of the
dependent variable varies as a function of the
independent variable.
In measuring unimodality of conditional
distributions, we judged the dependency to be
unimodal if (and only if) none of the conditional
distributions P(YX) were clearly multimodal.

67
Results
A_Example 1

What kind of non-linearities and how many are
captured by discrete Bayesian networks?
Investigation of two independent empirical data
(n 2430, n 762) showed that only 39 per
cent of all dependencies between variables were
purely linear (linear mode, linear mean,
unimodal).
Nine per cent of dependencies were purely
non-linear (non-linear mode, non-linear mean,
multimodal).
Multimodality was the most common reason for
violation of linearity in both datasets.

68
Results
A_Example 1

We continued investigations with two distinct
samples of the latter data, namely D21 (n 447)
and D23 (n 208).
Those two data were selected for the following
reasons
First, the sample sizes are more close to each
other when compared to the D22 data (n 71).
Second, the two samples are collected with the
same self-rated five-point Likert-scale
questionnaire.
Third, the D21 sample represents in this study
linear empirical data with 23.9 per cent of
pure linear and 15.0 per cent of pure non-linear
dependencies, and the D23 sample represents
non-linear data with only 16.2 per cent of pure
linear dependencies and 18.3 per cent of pure
non-linear dependencies.

69
Results
A_Example 1

Our first goal was to compare subject domain
interpretations of linear correlational analysis
and non-linear Bayesian dependency models in
order to investigate if the models would differ
in terms of interpretation according to the
Growth-oriented Atmosphere model.
Is there difference between the results of linear
bivariate correlations and Bayesian networks?
The results showed that in general Bayesian
network models were congruent with the
correlation matrixes as both methods found the
same variables independent of all the other
variables.
However, non-linear modeling found with both
samples greater number of strong dependencies
between growth-oriented atmosphere factors.

70
Results
A_Example 1

Our second goal was to investigate the following
four aspects of the growth-oriented atmosphere
theory
support and rewards from the management,
the incentive value of the job,
operational capacity of the team, and
work-related stress.
Does an empirical sample containing pure linear
dependencies have better overall fit indices in
CFA than sample containing less linear
dependencies?

71
Results
A_Example 1

First, when comparing CFA and Bayesian modeling,
we learned that latter is unable to find support
for the second aspect under investigation, namely
the relationship between incentive value of the
job, know-how developing and valuation of the
job.
Second, we found no major differences in results
between linear and non-linear samples.
However, the linear data has higher parameter
estimates in all four aspects under
investigation.

72
Results
A_Example 1

Next we investigated if theoretically justifiable
dependencies between factors found by Bayesian
models are also present in the CFA models.
Does an empirical sample containing pure linear
dependencies have higher CFA parameter estimates
than sample containing less linear dependencies?
We conducted confirmatory factor analysis with
the growth-oriented atmosphere model and examined
the differences between the linear (D21) and
non-linear (D23) factor covariance matrixes.
The results showed that the CFA model performed
better with the linear sample.

73
Results
A_Example 1

Is there difference between substantive
interpretations of the results of CFA and BDM
with linear and non-linear samples?
Both Bayesian dependency models did not support
the second theoretical assumption about the
relationship between Incentive value of the job
(INV) and Know-how developing (DEV) and Valuation
of the job (VAL).

74
Results
A_Example 1
75
Results
A_Example 1

Is there difference between substantive
interpretations of the results of CFA and BDM
with linear and non-linear samples?
The second observation is that the fourth
theoretical assumption about the negative
influence of Psychic stress (PSY) on all the
other factors is only partially supported in both
Bayesian models.

76
Results
A_Example 1
77
Results
A_Example 1

Is there difference between substantive
interpretations of the results of CFA and BDM
with linear and non-linear samples?
Finally, the linear sample (D21) has in most
cases higher CFA parameter estimates than the
non-linear sample.

78
Conclusions
A_Example 1

This study investigated the number of non-linear
and multi-modal relationships between variables
in various real-world empirical Growth-oriented
Atmosphere data.
Investigation of two independent empirical data
(n 2430 and n 762) showed that only 39 per
cent of all dependencies between variables were
purely linear (linear mode, linear mean,
unimodal).
Nine per cent of dependencies were purely
non-linear (non-linear mode, non-linear mean,
multimodal).
Multimodality was the most common reason for
violation of linearity in both datasets.

79
Conclusions
A_Example 1

Two subgroups of the latter data were identified
as linear (D21, n 447) and non-linear (D23,
n 208).
Both correlational analysis and Bayesian
dependency modeling were applied to these data in
order to investigate relationships between
fourteen factors of the growth-oriented
atmosphere model.
Our conclusion that is based on the preliminary
analysis of two relatively small empirical
samples is that descriptive power of traditional
linear models (e.g., correlational analysis) is
sufficient with non-linear data (pure linear
dependencies vary between 16.2 - 23.9 ).

80
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization

81
Bayesian Classification Modeling

Which variables are the best predictors for
different group memberships (e.g., A or C group,
gender, productivity, level of giftedness).
In the classification process, the automatic
search is looking for the best set of variables
to predict the class variable for each data item.

82
Bayesian Classification Modeling

The search procedure resembles the traditional
linear discriminant analysis (LDA, Huberty, 1994,
118-126), but the implementation is totally
different.
For example, a variable selection problem that is
addressed with forward, backward or stepwise
selection procedure in LDA is replaced with a
genetic algorithm approach (e.g., Hilario,
Kalousisa, Pradosa Binzb, 2004 Hsu, 2004) in
the Bayesian classification modeling.

83
Bayesian Classification Modeling

The genetic algorithm approach means that
variable selection is not limited to one (or two
or three) specific approach instead many
approaches and their combinations are exploited.
One possible approach is to begin with the
presumption that the models (i.e., possible
predictor variable combinations) that resemble
each other a lot (i.e., have almost same
variables and discretizations) are likely to be
almost equally good.
This leads to a search strategy in which models
that resemble the current best model are selected
for comparison, instead of picking models
randomly.

84
Bayesian Classification Modeling

Another approach is to abandon the habit of
always rejecting the weakest model and instead
collect a set of relatively good models.
The next step is to combine the best parts of
these models so that the resulting combined model
is better than any of the original models.
B-Course is capable of mobilizing many more
viable approaches, for example, rejecting the
better model (algorithms like hill climbing,
simulated annealing) or trying to avoid picking
similar model twice (tabu search).

85
C_Example 3 Motivational Predictors for Study
Group Membership

Researcher picked from a student population (N
240) a sub sample (one study group) (n 23) for
closer investigation.
His goal was to study how students self-reported
learning motivation was related to academic
success of group works.
The sub sample consisted of five groups that
studens had formed by themselves in the early
stages of their studies.

86
C_Example 3 Motivational Predictors for Study
Group Membership

All the participants (n 23) filled the
Abilities for Professional Learning Questionnaire
(Ruohotie, 2002 Nokelainen Ruohotie, 2002)
that measures their motivational level.
See research article 3 for item descriptions.
Researcher interviewed the participants to
profile the groups.

87
C_Example 3 Motivational Predictors for Study
Group Membership

Classification analysis was conducted using class
membership as class variable.
Aim of the analysis was to learn which of the
APLQ items (i.e., learning motivation dimensions)
would best predict differencies between the
groups.
Naïve Bayes Network that is produced in the
analysis is used to examine special features of
the groups.
In addition, it allows groupwise comparison.

88
C_Example 3 Motivational Predictors for Study
Group Membership
Sample size n 23
Classification accuracy 60.87.
Common components V2, V3, V4, V6, V7, V8, V10,
V11, V12, V13, V14, V16, V17, V20, V21, V22, V23,
V25, V26, V28
89
C_Example 3 Motivational Predictors for Study
Group Membership
90
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers

The study investigated how components of mobile
learning predict the use of different computer
devices (Syvänen, Nokelainen, Ahonen Turunen,
2003).
Sample (n 87) consisted of 5th and 6th grade
Finnish elementary school students.

91
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers

Classification variable was device that has
three values
1 Handheld computer2 Portable computer3
Desktop computer
Fourteen questions were asked from the students
to measure their mobile learning experineces.

92
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers

Variable Description
DEEP Deep approach
HELPSEE Help-seeking
MANAGEM Learning management
CREATIV Creativity in problem solving
EFFECTI Perceived effectiveness
SELFEFF Self-efficacy
SEARCH Knowledge seeking
SHARE Knowledge sharing
DUALISM Conception of knowledge
SURFACE Surface approach
CONFIDENCE Computer confidence
PEERLEARN Peer learning
EASINESS Perceived easiness of use
CONSTRUC Knowledge construction

93
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
Sample size n 87
Classification accuracy 62.32.
Common components DUALISM, SURFACE, CONFIDENCE,
PEERLEARN, EASINESS, CONSTRUC
94
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
95
Investigating the Influence of Attribution Styles
on the Development of Mathematical Talent
A_Example 2

Petri Nokelainen
Research Centre for Vocational Education
University of Tampere, Finland
Kirsi Tirri
Department of Practical Theology
University of Helsinki, Finland
Hanna-Leena Merenti-Välimäki
Espoo-Vantaa Institute of Technology, Finland

See printed version!
(Nokelainen, Tirri Merenti-Välimäki, 2007.)
96
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization

97
Bayesian Dependency Modeling

Bayesian dependency modeling (BDM) is applied to
examine dependencies between variables by both
their visual representation and probability ratio
of each dependency
Graphical visualization of Bayesian network
contains two components
1) Observed variables visualized as ellipses.
2) Dependences visualized as lines between nodes.

98
C_Example 5 Calculation of Bayesian Score

Next, I will present how Bayesian score (BS),
that is, the probability of the model P(MD), is
firstly calculated and secondly compared for the
two models presented in the figure

Figure 9. An Example of Two Competing Bayesian
Network Structures
(Nokelainen, 2008, p. 121)
99
C_Example 5 Calculation of Bayesian Score

Let us assume that we have the following data
x1 x2
1 1
1 1
2 2
1 2
1 1
Model 1 (M1) represents the two variables, x1 and
x2 respectively, without statistical dependency,
and the model 2 (M2) represents the two variables
with a dependency (i.e., with a connecting arc).
The binomial data might be a result of an
experiment, where the five participants have
solved a job related task before (x1) and after
(x2) a vocational training period.

100
C_Example 5 Calculation of Bayesian Score

In order to calculate P(M1,2D), we need to solve
P(DM1,2) for the two models M1 and M2.
Probability of the data given the model is solved
by using the following marginal likelihood
equation (Congdon, 2001, p. 473 Myllymäki,
Silander, Tirri, Uronen, 2001 Myllymäki
Tirri, 1998, p. 63)

101
C_Example 5 Calculation of Bayesian Score

In the Equation 4, following symbols are used
n is the number of variables (i indexes variables
from 1 to n)
r1 is the number of values in ith variable (k
indexes these values from 1 to ri
qi is the number of possible configurations of
parents of ith variable
Nij describes the number of rows in the data that
have jth configuration for parents of ith
variable
Nijk describes how many rows in the data have
kth value for the ith variable also have jth
configuration for parents of Ith variable
N is the equivalent sample size set to be the
average number of values divided by two.
The marginal likelihood equation produces a
Bayesian Dirichlet score that allows model
comparison (Heckerman et al., 1995 Tirri, 1997
Neapolitan Morris, 2004).

102
C_Example 5 Calculation of Bayesian Score

First, I will calculate P(DM1) given the values
of variable x1

(2/2)/1
(2/2)/21
x1 x2 1 1 1 1 2 2 1 2 1 1
103
C_Example 5 Calculation of Bayesian Score

Second, the values for the x2 are calculated

104
C_Example 5 Calculation of Bayesian Score

The BS, probability for the first model P(M1D),
is 0.027 0.012 ? 0.000324.

105
C_Example 5 Calculation of Bayesian Score

Third, P(DM2) is calculated given the values of
variable x1

106
C_Example 5 Calculation of Bayesian Score

Fourth, the values for the first parent
configuration (x1 1) are calculated

107
C_Example 5 Calculation of Bayesian Score

Fifth, the values for the second parent
configuration (x1 2) are calculated

108
C_Example 5 Calculation of Bayesian Score

The BS, probability for the second model P(M2D),
is 0.027 0.027 0.500 ? 0.000365.

109
C_Example 5 Calculation of Bayesian Score

Bayes theorem enables the calculation of the
ratio of the two models, M1 and M2.
As both models share the same a priori
probability, P(M1) P(M2), both probabilities
are canceled out.
Also the probability of the data P(D) is canceled
out in the following equation as it appears in
both formulas in the same position

110
C_Example 5 Calculation of Bayesian Score

The result of model comparison shows that since
the ratio is less than 1, the M2 is more probable
than M1.
This result becomes explicit when we investigate
the sample data more closely.
Even a sample this small (n 5) shows that there
is a clear tendency between the values of x1 and
x2 (four out of five value pairs are identical).

x1 x2 1 1 1 1 2 2 1 2 1 1
111
C_Example 6 Modeling of Prerequisites for
Organizational Learning

Staff of five Finnish police organization (n
281) filled out Growth Oriented Atmosphere
Questionnaire (Luoma, Nokelainen Ruohotie,
2002).
BDM was applied to study how the theoretical
model is represented in the Bayesian Network.

112
C_Example 6 Modeling of Prerequisites for
Organizational Learning
113
C_Example 6 Modeling of Prerequisites for
Organizational Learning
114
Investigating Subordinates' Evaluations on their
Superiors Emotional Leadership
A_Example 3

Petri Nokelainen
Pekka Ruohotie
Research Centre for Vocational Education
University of Tampere, Finland

See printed version!
(Nokelainen Ruohotie, 2005.)
115
Conceptual Modeling of Self-rated
Intelligence-profile
A_Example 4

Kirsi Tirri and Erkki Komulainen
University of Helsinki, Finland
Petri Nokelainen and Henry Tirri
Helsinki University of Technology, Finland

See printed version!
(Tirri, K., Komulainen, Nokelainen Tirri, H.,
2002.)
116
Outline

Research Overview
Introduction to Bayesian Modeling
Investigating Non-linearities with Bayesian
Networks
Bayesian Classification Modeling
Bayesian Dependency Modeling
Bayesian Unsupervised Model-based Visualization

117
Bayesian Unsupervised Model-based Visualization

Dispersion of single data vectors in
three-dimensional space in order to find how the
factors are interrelated in individual level.
The data is mapped into different set of
dimensions according to the optimized solution
from which Bayesian algorithm produced one
optimal model.
The three-dimensional model is plotted into
series of two-dimensional figures each presenting
one dimension at the time.
BayMiner http//www.bayminer.com

118
(No Transcript)
119
Investigating Growth Prerequisites in a Finnish
Polytechnic Institute of Higher Education
A_Example 5

Petri Nokelainen
Pekka Ruohotie
Research Centre for Vocational Education
University of Tampere, Finland

See printed version!
(Nokelainen Ruohotie, in press, to appear in
the Journal of Workplace Learning.)
120
Links

Research Centre for Vocational Education ltURL
http//www.uta.fi/aktkk gt
Complex Systems Computation Group ltURL
http//cosco.hiit.fi gt
EDUTECH ltURL http//cosco.hiit.fi/edutech gt
B-COURSE ltURL http//b-course.hiit.fi gt
BayMiner ltURL http//www.bayminer.com gt

121
References

Abelson, R. P. (1995). Statistics as Principled
Argument. Hillsdale, NJ Lawrence Erlbaum
Associates.
Anderson, J. (1995). Cognitive Psychology and Its
Implications. Freeman New York.
Bayes, T. (1763). An essay towards solving a
problem in the doctrine of chances. Philosphical
Transactions of the Royal Society, 53, 370-418.
Bernardo, J., Smith, A. (2000). Bayesian
theory. New York Wiley.
Brannen, J. (2004). Working qualitatively and
quantitatively. In C. Seale, G. Gobo, J. Gubrium,
D. Silverman (Eds), Qualitative Research
Practice (pp. 312-326). London Sage.
Fisher, R. (1935). The design of experiments.
Edinburgh Oliver Boyd.

122
References

Gigerenzer, G. (2000). Adaptive thinking. New
York Oxford University Press.
Gill, J. (2002). Bayesian methods. A Social and
Behavioral Sciences Approach. Boca Raton Chapman
Hall/CRC.
Gigerenzer, G., Krauss, S., Vitouch, O. (2004).
The null ritual What you always wanted to know
about significance testing but were afraid to
ask. In D. Kaplan (Ed.), The SAGE handbook of
quantitative methodology for the social sciences
(pp. 391-408). Thousand Oaks Sage.
Gobo, G. (2004). Sampling, representativeness and
generalizability. In C. Seale, J. F. Gubrium, G.
Gobo, D. Silverman (Eds.), Qualitative Research
Practice (pp. 435-456). London Sage.
Hair, J. F., Anderson, R. E., Tatham, R. L.,
Black, W. C. (1998). Multivariate Data Analysis.
Fifth edition. Englewood Cliffs, NJ Prentice
Hall.

123
References

Heckerman, D., Geiger, D., Chickering, D.
(1995). Learning Bayesian networks The
combination of knowledge and statistical data.
Machine Learning, 20(3), 197-243.
Lavine, M. L. (1999). What is Bayesian Statistics
and Why Everything Else is Wrong. The Journal of
Undergraduate Mathematics and Its Applications,
20, 165-174.
Lindley, D. V. (1971). Making Decisions. London
Wiley.
Lindley, D. V. (2001). Harold Jeffreys. In C. C.
Heyde E. Seneta (Eds.), Statisticians of the
Centuries, (pp. 402-405). New York Springer.
Luoma, M., Nokelainen, P., Ruohotie, P. (2003,
April). Learning Strategies for Police
Organization - Modeling Organizational Learning
Prerequisites. Paper presented at the Annual
Meeting of American Educational Research
Association (AERA 2002). New Orleans, USA.

124
References

Myllymäki, P., Silander, T., Tirri, H., Uronen,
P. (2002). B-Course A Web-Based Tool for
Bayesian and Causal Data Analysis. International
Journal on Artificial Intelligence Tools, 11(3),
369-387.
Myllymäki, P., Tirri, H. (1998).
Bayes-verkkojen mahdollisuudet Possibilities of
Bayesian Networks. Teknologiakatsaus 58/98.
Helsinki TEKES.

125
References

Nokelainen, P. (2008). Modeling of Professional
Growth and Learning Bayesian Approach. Tampere
Tampere University Press.
Nokelainen, P., Ruohotie, P. (2005).
Investigating the Construct Validity of the
Leadership Competence and Characteristics Scale.
In the Proceedings of International Research on
Work and Learning 2005 Conference, Sydney,
Australia.
Nokelainen, P., Ruohotie, P. (In press).
Investigating Growth Prerequisites in a Finnish
Polytechnic for Higher Education. To appear in
the Journal of Workplace Learning.

126
References

Nokelainen, P., Silander, T., Ruohotie, P,
Tirri, H. (2003, August). Investigating
Non-linearities with Bayesian Networks. Paper
presented at 111th Annual Convention of the
American Psychology Association, Division of
Evaluation, Measurement and Statistics. Toronto,
Canada.
Nokelainen, P., Silander, T., Ruohotie, P.,
Tirri, H. (2007). Investigating the Number of
Non-linear and Multi-modal Relationships Between
Observed Variables Measuring A Growth-oriented
Atmosphere. Quality Quantity, 41(6), 869-890.

127
References

Nokelainen, P., Tirri, K., Merenti-Välimäki,
H.-L. (2007). Investigating the Influence of
Attribution Styles on the Development of
Mathematical Talent. Gifted Child Quarterly,
51(1), 64-81.
Syvänen, A., Nokelainen, P., Ahonen, M.,
Turunen, H. (2003, August). Approaches to
Assessing Mobile Learning Components. Paper
presented at 10th Biennal Conference of the
European Association for Research on Learning and
Instruction. Padova, Italy.
Thompson, B. (1994). Guidelines for authors.
Educational and Psychological Measurement, 54(4),
837-847.

128
References

Tirri, K., Komulainen, E., Nokelainen, P.,
Tirri, H. (2002). Conceptual Modeling of
Self-Rated Intelligence-Profile. In Proceedings
of the 2nd International Self-Concept Research
Conference. University of Western Sydney, Self
Research Center.
de Vaus, D. A. (2004). Research Design in Social
Research. Third edition. London Sage.

Write a Comment

User Comments (0)

About PowerShow.com

Research Centre for Vocational Education PowerPoint PPT Presentation