Research Centre for Vocational Education - PowerPoint PPT Presentation

1 / 128
About This Presentation
Title:

Research Centre for Vocational Education

Description:

Investigating Non-linearities with Bayesian Networks. Bayesian ... Complex Systems Computation Group (CoSCo) Helsinki University of Technology, Finland ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 129
Provided by: petrinok
Category:

less

Transcript and Presenter's Notes

Title: Research Centre for Vocational Education


1
Research Centre for Vocational Education
Bayesian Modeling in Educational Research
Petri Nokelainen petri.nokelainen_at_uta.fi
v2.2 This lecture is available at
http//www.uta.fi/aktkk/lectures/bayes_en
2
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification Modeling
  • Bayesian Dependency Modeling
  • Bayesian Unsupervised Model-based Visualization

3
Research Overview
VOCATIONAL EDUCATION
COMPUTER SCIENCE
Professional growth and learning
Applied research
Pekka Ruohotie Petri Nokelainen
Online pedagogy
Authentic use of educational technology
4
Research Overview
RCVE Data Server - On-line Questionnaire
BayMiner - Bayesian Supervised and Unsupervised
Visualization
2008
1997
5
Research Overview
1. BAYESIAN MODELING -gt Linear vs. non-linear
dependencies -gt Dependency and
classification modeling
(B-COURSE) -gt Unsupervised model-based
visualisation (BAYMINER)
6
Linear and Non-linear Dependencies
7
Bayesian Classification Modeling
The classification accuracy of the best model
found is 83.48 (58.57).
COMMON FACTORS PUB_T CC_PR CC_HE PA C_SHO C_FAIL
CC_AB CC_ES
8
Bayesian Dependency Modeling
9
Bayesian Unsupervised Model-based Visualization
10
Current Research
  • 1. Bayesian track
  • Normalized Maximum Likelihood (NML) test
  • Bayesian dependency modeling with latent
    variables
  • Visualization of Bayesian networks (Banex)
  • 2. Educational technology track
  • Proactive search to help learners to manage
    information online
  • Empirical validation of the technical and
    pedagogical usability criteria of digital
    learning material

11
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification modeling
  • Bayesian Dependency modeling
  • Bayesian Unsupervised Model-based Visualization

12
Putting Bayesian Techniques on the Map
13
Introduction to Bayesian Modeling
  • Probability is a mathematical construct that
    behaves in accordance with certain rules and can
    be used to represent uncertainty.
  • Bayesian inference uses conditional probabilities
    to represent uncertainty.
  • P(M D,I) - the probability of unknown things
    (M) given the data (D) and background information
    (I).

14
Introduction to Bayesian Modeling
  • The essence of Bayesian inference is in the rule,
    known as Bayes' theorem, that tells us how to
    update our initial probabilities P(M I) if we
    see data D, in order to find out P(M D,I).
  • A priori probability
  • Conditional probability
  • Posteriori probability

P(DM) P(M) P(MD) P(DM)P(M
) P(DM) P(M)
15
Introduction to Bayesian Modeling
  • Bayesian inference comprises the following three
    principal steps
  • (1) Obtain the initial probabilities P(M I) for
    the unknown things. (Prior distribution.)
  • (2) Calculate the probabilities of the data D
    given different values for the unknown things,
    i.e., P(D M, I). (Likelihood.)
  • (3) Finally the probability distribution of
    interest, P(M D, I), is calculated using
    Bayes' theorem. (Posterior distribution.)
  • Bayes' theorem can be used sequentially.

16
Introduction to Bayesian Modeling
  • Bayesian method
  • (1) is parameter-free and the user input is not
    required, instead, prior distributions of the
    model offer a theoretically justifiable method
    for affecting the model construction,
  • (2) works with probabilities and can hence be
    expected to produce robust results with discrete
    data containing nominal and ordinal attributes,
  • (3) has no limit for minimum sample size,
  • (4) is able to analyze both linear and
    non-linear dependencies,
  • (5) assumes no multivariate normal model.

17
C_Example 1 Applying Bayes Theorem
  • Company A is employing workers on short term
    jobs that are well paid.
  • The job sets certain prerequisites to applicants
    linguistic abilities and looks.
  • Earlier all the applicants were interviewed, but
    nowadays it has become an impossible task as both
    the number of open vacancies and applicants has
    increased enormously.
  • Personnel department of the company was ordered
    to develop a questionnaire to preselect the most
    suitable applicants for the interview.

18
C_Example 1 Applying Bayes Theorem
  • Psychometrician who developed the instrument
    estimates that it would work out right on 90 out
    of 100 applicants, if they are honest.
  • We know on the basis of earlier interviews that
    the terms (linguistic abilities, looks) are valid
    for one per 100 person living in the target
    population.
  • The question is If an applicant gets enough
    points to participate in the interview, is he or
    she hired for the job (after an interview)?

19
C_Example 1 Applying Bayes Theorem
  • A priori probability P(H) is described by the
    number of those people in the target population
    that really are able to meet the requirements of
    the task (1 out of 100 .01).
  • Counter assumption of the a priori is P(H) that
    equals to 1-P(H), thus it is .99.
  • Psychometricians beliefs about how the instrument
    works is called conditional probability P(EH)
    .9.
  • Instruments failure to indicate non-valid
    applicants, i.e., those that are not able to
    succeed in the following interview, is stated as
    P(EH) that equals to .1.
  • These values need not to sum to one!

20
C_Example 1 Applying Bayes Theorem
  • A priori probability
  • Conditional probability
  • Posterior probability
  • P(EH) P(H)
  • P(HE)
  • P(EH) P(H) P(EH) P(H)

(.9) (.01) P(HE) (.9)
(.01) (.1) (.99)
.08
21
C_Example 1 Applying Bayes Theorem
22
C_Example 1 Applying Bayes Theorem
  • What if the measurement error of the
    psychometricians instrument would have been 20
    per cent?
  • P(EH)0.8 P(EH)0.2

23
C_Example 1 Applying Bayes Theorem
24
C_Example 1 Applying Bayes Theorem
  • What if the measurement error of the
    psychometricians instrument would have been only
    one per cent?
  • P(EH)0.99 P(EH)0.01

25
C_Example 1 Applying Bayes Theorem
26
C_Example 1 Applying Bayes Theorem
  • Quite often people tend to estimate the
    probabilities to be too high or low, as they are
    not able to update their beliefs even in simple
    decision making tasks when situations change
    dynamically (Anderson, 1995, p. 326).

27
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • One of the most important rules educational
    science scientific journals apply to judge the
    scientific merits of any submitted manuscript is
    that all the reported results should be based on
    so called null hypothesis significance testing
    procedure (NHSTP) and its featured product,
    p-value.
  • Gigerenzer, Krauss and Vitouch (2004, p. 392)
    describe the null ritual as follows
  • 1) Set up a statistical null hypothesis of no
    mean difference or zero correlation. Dont
    specify the predictions of your research or of
    any alternative substantive hypotheses
  • 2) Use 5 per cent as a convention for rejecting
    the null. If significant, accept your research
    hypothesis
  • 3) Always perform this procedure.

28
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • A p-value is the probability of the observed data
    (or of more extreme data points), given that the
    null hypothesis H0 is true, P(DH0) (id.).
  • The first common misunderstanding is that the
    p-value of, say t-test, would describe how
    probable it is to have the same result if the
    study is repeated many times (Thompson, 1994).
  • Gerd Gigerenzer and his colleagues (id., p. 393)
    call this replication fallacy as P(DH0) is
    confused with 1P(D).

29
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • The second misunderstanding, shared by both
    applied statistics teachers and the students, is
    that the p-value would prove or disprove H0.
    However, a significance test can only provide
    probabilities, not prove or disprove null
    hypothesis.
  • Gigerenzer (id., p. 393) calls this fallacy an
    illusion of certainty Despite wishful thinking,
    p(DH0) is not the same as P(H0D), and a
    significance test does not and cannot provide a
    probability for a hypothesis.
  • A Bayesian statistics provide a way of
    calculating a probability of a hypothesis
    (discussed later in this section).

30
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • I have been teaching elementary level statistics
    for educational science students for the last 12
    years.
  • My latest statistics course grades (Autumn 2006,
    n 12) ranged from one to five as follows 1) n
    3 2) n 2 3) n 4 4) n 2 5) n 1,
    showing that the lowest grade frequency (1) from
    the course is three (25.0).
  • Previous data from the same course (2000-2005)
    shows that only five students out of 107 (4.7)
    had the lowest grade.
  • Next, I will use the classical statistical
    approach (the likelihood principle) and Bayesian
    statistics to calculate if the number of the
    lowest course grades is exceptionally high on my
    latest course when compared to my earlier stat
    courses.

31
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • There are numerous possible reasons behind such
    development, for example, I have become more
    critical on my assessment or the students are
    less motivated in learning quantitative
    techniques.
  • However, I believe that the most important
    difference between the last and preceding courses
    is that the assessment was based on a computer
    exercise with statistical computations.
  • The preceding courses were assessed only with
    essay answers.

32
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • I assume that the 12 students earned their grade
    independently (independent observations) of each
    other as the computer exercise was conducted
    under my or my assistants supervision.
  • I further assume that the chance of getting the
    lowest grade (?), is the same for each student.
  • Therefore X, the number of lowest grades (1) in
    the scale from 1 to 5 among the 12 students in
    the latest stat course, has a binomial (12, ?)
    distribution X Bin(12, ?).
  • For any integer r between 0 and 12,

33
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • The expected number of lowest grades is 12(5/107)
    0.561.
  • Theta is obtained by dividing the expected number
    of lowest grades with the number of students
    0.561 / 12 ? 0.05.
  • The null hypothesis is formulated as follows H0
    ? 0.05, stating that the rate of the lowest
    grades from the current stat course is not a big
    thing and compares to the previous courses rates.
  • Three alternative hypotheses are formulated to
    address the concern of the increased number of
    lowest grades H1 ? 0.06 H2 ? 0.07 H3 ?
    0.08.

34
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • To compare the hypotheses, we calculate binomial
    distributions for each value of ?.
  • For example, the null hypothesis (H0) calculation
    yields

35
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • The results for the alternative hypotheses are as
    follows
  • PH1(3.06, 12) ? .027
  • PH2(3.07, 12) ? .039
  • PH3(3.08, 12) ? .053.
  • The ratio of the hypotheses is roughly 1223
    and could be verbally interpreted with statements
    like the second and third hypothesis explain the
    data about equally well, or the fourth
    hypothesis explains the data about three times as
    well as the first hypothesis.

36
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • Lavine (1999) reminds that P(r?, n), as a
    function of r (3) and ? .05 .06 .07 .08,
    describes only how well each hypotheses explains
    the data no value of r other than 3 is relevant.
  • For example, P(4.05, 12) is irrelevant as it
    does not describe how well any hypothesis
    explains the data.
  • This likelihood principle, that is, to base
    statistical inference only on the observed data
    and not on a data that might have been observed,
    is an essential feature of Bayesian approach.

37
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • The Fisherian, so called classical approach to
    test the null hypothesis (H0 ? .05) against
    the alternative hypothesis (H1 ? gt .05) is to
    calculate the p-value that defines the
    probability under H0 of observing an outcome at
    least as extreme as the outcome actually
    observed

38
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • After calculations, the p-value becomes ? 0.02
    and would suggest H0 rejection, if the rejection
    level of significance is set at 5 per cent.
  • Another problem with the p-value is that it
    violates the likelihood principle by using P(r?,
    n) for values of r other than the observed value
    of r 3 (Lavine, 1999)
  • The summands of P(4.05, 12), P(5.05, 12), ,
    P(12.05, 12) do not describe how well any
    hypothesis explains observed data.

39
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • A Bayesian approach will continue from the same
    point as the classical approach, namely
    probabilities given by the binomial
    distributions, but also make use of other
    relevant sources of a priori information.
  • In this domain, it is plausible to think that the
    computerized test would make the number of total
    failures more probable than in the previous times
    when the evaluation was based solely on the
    essays.
  • On the other hand, the computer test has only 40
    per cent weight in the equation that defines the
    final stat course grade .3(Essay_1)
    .3(Essay_2) .4(Computer test)/3 Final grade.

40
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • Another aspect is to consider the nature of the
    aforementioned tasks, as the essays are distance
    work assignments while the computer test is to be
    performed under observation.
  • Perhaps the course grades of my earlier stat
    courses have a narrower dispersion due to
    violence of the independent observation
    assumption?
  • For example, some students may have copy-pasted
    text from other sources or collaborated without a
    permission.
  • As we see, there are many sources of a priori
    information that I judge to be inconclusive and,
    thus, define that null hypothesis is as likely to
    be true or false.

41
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • This a priori judgment is expressed
    mathematically as P(H0) ? 1/2 ? P(H1) P(H2)
    P(H3). If I further assume that the alternative
    hypotheses H1, H2 or H3 share the same likelihood
    P(H1) ? P(H2) ? P(H3) ? 1/6.
  • These prior distributions summarize the knowledge
    about ? prior to incorporating the information
    from my course grades.

42
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • An application of Bayes' theorem yields

43
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • Similar calculations for the alternative
    hypotheses yields P(H1r3) ? .16 P(H2r3) ?
    .29 P(H3r3) ? .31.
  • These posterior distributions summarize the
    knowledge about ? after incorporating the grade
    information.
  • The four hypotheses seem to be about equally
    likely (.30 vs. .16, .29, .31).
  • The odds are about 2 to 1 (.30 vs. .70) that the
    latest stat course had higher rate of lowest
    grades than 0.05.

44
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • The difference between the classical and Bayesian
    statistics would be only philosophical
    (probability vs. inverse probability) if they
    would always lead to similar conclusions.
  • However, in this case the p-value would suggest
    rejection of H0 (p .02), but the Bayesian
    analysis indicate not very strong evidence
    against ? .05, only about 2 to 1.

45
C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
  • What if the number of the lowest grades is two?
  • The classical approach would not anymore suggest
    H0 rejection (p .12).
  • Bayesian result would stay quite the same (.39
    vs. .17, .20, .24), saying that there is not much
    evidence against the H0.

46
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification Modeling
  • Bayesian Dependency Modeling
  • Bayesian Unsupervised Model-based Visualization

47
Investigating the Number of Non-linear and
Multi-modal Relationships Between Observed
Variables Measuring A Growth-oriented Atmosphere
A_Example 1
  • Petri Nokelainen
  • Pekka Ruohotie
  • Research Centre for Vocational Education (RCVE)
  • University of Tampere, Finland
  • Tomi Silander
  • Complex Systems Computation Group (CoSCo)
  • Helsinki University of Technology, Finland
  • Henry Tirri
  • Nokia Research Center, Finland

See printed version!
(Nokelainen, Silander, Ruohotie Tirri, 2007,
2003)
48
Introduction
A_Example 1
  • In the social science researchers point of view,
    the requirements of traditional frequentistic
    statistical analysis are very challenging.
  • For example, the assumption of normality of both
    the phenomena under investigation and the data is
    prerequisite for traditional parametric
    frequentistic calculations.

49
Introduction
A_Example 1
  • In situations where a latent construct cannot be
    appropriately represented as a continuous
    variable, or where ordinal or discrete indicators
    do not reflect underlying continuous variables,
    or where the latent variables cannot be assumed
    to be normally distributed, traditional Gaussian
    modeling is clearly not appropriate.
  • In addition, normal distribution analysis sets
    minimum requirements for the number of
    observations, and the measurement level of
    variables to be continuous.

50
Introduction
A_Example 1
  • Bayesian modeling approach is a good alternative
    to traditional frequentistic statistics as it is
    capable of handling small discrete, non-normal
    samples as well as large-scale continuous data
    sets.
  • The purpose of this paper is to investigate the
    number of non-linear and multi-modal
    relationships between variables in various
    real-world empirical Growth-oriented Atmosphere
    data in order to find how much they weaken the
    robustness of linear statistical methods.

51
Research Questions
A_Example 1
  • What kind of non-linearities and how many are
    captured by discrete Bayesian networks?
  • Is there difference between the results of linear
    bivariate correlations and Bayesian networks?
  • Does an empirical sample containing pure linear
    dependencies have better overall fit indices in
    CFA than sample containing less linear
    dependencies?
  • Does an empirical sample containing pure linear
    dependencies have higher CFA parameter estimates
    than sample containing less linear dependencies?
  • Are discrete Bayesian networks viable way to
    pre-model data before CFA?

52
Types of Non-linearities Studied
A_Example 1
  • We study two different kinds of
    "non-linearities",
  • Non-linear relationships between continuous
    variables and
  • Multi-modal relationships between continuous
    variables.
  • Further, we only study simple non-linear
    relationships between two variables
  • The dependency between variables X and Y is
    considered non-linear if the mean of the
    conditional distribution of Y is not a monotonous
    (i.e., increasing or decreasing) function of X.
  • Similarly, the dependency between variables X and
    Y is considered multi-modal if the mode of the
    conditional distribution of Y is not a monotonous
    function of X.

53
Bayesian Dependency Models
A_Example 1
  • This study resembles to some extent the work by
    Hofmann and Tresp in which they use method of
    Parzen windows to allow non-linear dependencies
    between continuous variables.
  • The emphasis in their work was to demonstrate the
    possibility to build Bayesian networks that can
    capture non-linear relationships.
  • By using discretized variables this possibility
    comes trivially, but our objective is to find out
    to what extent this possibility is used i.e. how
    many and what kind of non-linearities are
    captured by discrete Bayesian networks.

54
Bayesian Dependency Models
A_Example 1
  • Given the identically and independently
    distributed multivariate data set D over
    variables V and the prior probability
    distribution ? over Bayesian networks, Bayesian
    probability theory allows us to calculate the
    probability P(G  D, ?) of any Bayesian network
    G.
  • Different networks can then be compared by their
    probability.
  • Finding the most probable Bayesian network for
    any given data is known to be NP-hard, which
    practically ruins the hopes for the automatic
    discovery of the most probable network.
  • However, stochastic search methods have proven to
    be successful in finding high probability
    networks Once the network G has been constructed
    using data D, we can use it to calculate
    predictive joint distributions P(V  G, D).

55
Bayesian Dependency Models
A_Example 1
  • Bayesian network structure can be used to
    effectively calculate conditional marginals of
    the predictive joint distribution for single
    variables i.e. P(Vi  A, G, D), where A is any
    subset of variables of V.
  • In this paper we only study the marginals, where
    A is a singleton Vj and there is either an
    arrow from Vi to Vj or an arrow from Vj to Vi (we
    say that Vi and Vj are adjacent in G).

56
Linear and Non-linear Dependencies
A_Example 1
  • Frequentistic parametric statistical techniques
    are designed for normally distributed (both
    theoretically and empirically) indicators that
    have linear dependencies.
  • Univariate normality
  • Multivariate normality
  • Bivariate linearity

57
Linear and Non-linear Dependencies
A_Example 1
rP -1.00
58
Linear and Non-linear Dependencies
A_Example 1
  • Sometimes univariate/multivariate assumption is
    true, but bivariate linearity is violated.

59
Linear and Non-linear Dependencies
A_Example 1
rP -.39
60
Linear and Non-linear Dependencies
A_Example 1
  • In some cases, univariate normality is violated,
    resulting in linear dependency.

61
Linear and Non-linear Dependencies
A_Example 1
rP.77
62
Linear and Non-linear Dependencies
A_Example 1
  • In some cases, univariate normality is violated,
    resulting in non-linear dependency.

63
Linear and Non-linear Dependencies
A_Example 1
rP.59
64
Linear and Non-linear Dependencies
A_Example 1
rP.59
65
Measuring Non-linearities
A_Example 1
  • To measure non-linear dependencies captured by
    Bayesian networks, we tested every variable in
    each network by conditioning it one by one with
    its immediate neighbors in the network.
  • We then observed whether the modes and means of
    the conditional distributions were "linear" and
    whether the conditional distributions were
    "unimodal".
  • Linearity of modes and means was tested by
    recording whether the means and modes were
    increasing or decreasing functions of
    conditioning variable.
  • Even clear departures from line like behavior
    were accepted as linear as long as the direction
    of correlation (positive, negative) did not
    change.

66
Measuring Non-linearities
A_Example 1
  • In these experiments, linear means relationship
    that can be more or less adequately modeled by a
    line describing how central tendency of the
    dependent variable varies as a function of the
    independent variable.
  • In measuring unimodality of conditional
    distributions, we judged the dependency to be
    unimodal if (and only if) none of the conditional
    distributions P(YX) were clearly multimodal.

67
Results
A_Example 1
  • What kind of non-linearities and how many are
    captured by discrete Bayesian networks?
  • Investigation of two independent empirical data
    (n 2430, n 762) showed that only 39 per
    cent of all dependencies between variables were
    purely linear (linear mode, linear mean,
    unimodal).
  • Nine per cent of dependencies were purely
    non-linear (non-linear mode, non-linear mean,
    multimodal).
  • Multimodality was the most common reason for
    violation of linearity in both datasets.

68
Results
A_Example 1
  • We continued investigations with two distinct
    samples of the latter data, namely D21 (n 447)
    and D23 (n 208).
  • Those two data were selected for the following
    reasons
  • First, the sample sizes are more close to each
    other when compared to the D22 data (n 71).
  • Second, the two samples are collected with the
    same self-rated five-point Likert-scale
    questionnaire.
  • Third, the D21 sample represents in this study
    linear empirical data with 23.9 per cent of
    pure linear and 15.0 per cent of pure non-linear
    dependencies, and the D23 sample represents
    non-linear data with only 16.2 per cent of pure
    linear dependencies and 18.3 per cent of pure
    non-linear dependencies.

69
Results
A_Example 1
  • Our first goal was to compare subject domain
    interpretations of linear correlational analysis
    and non-linear Bayesian dependency models in
    order to investigate if the models would differ
    in terms of interpretation according to the
    Growth-oriented Atmosphere model.
  • Is there difference between the results of linear
    bivariate correlations and Bayesian networks?
  • The results showed that in general Bayesian
    network models were congruent with the
    correlation matrixes as both methods found the
    same variables independent of all the other
    variables.
  • However, non-linear modeling found with both
    samples greater number of strong dependencies
    between growth-oriented atmosphere factors.

70
Results
A_Example 1
  • Our second goal was to investigate the following
    four aspects of the growth-oriented atmosphere
    theory
  • support and rewards from the management,
  • the incentive value of the job,
  • operational capacity of the team, and
  • work-related stress.
  • Does an empirical sample containing pure linear
    dependencies have better overall fit indices in
    CFA than sample containing less linear
    dependencies?

71
Results
A_Example 1
  • First, when comparing CFA and Bayesian modeling,
    we learned that latter is unable to find support
    for the second aspect under investigation, namely
    the relationship between incentive value of the
    job, know-how developing and valuation of the
    job.
  • Second, we found no major differences in results
    between linear and non-linear samples.
  • However, the linear data has higher parameter
    estimates in all four aspects under
    investigation.

72
Results
A_Example 1
  • Next we investigated if theoretically justifiable
    dependencies between factors found by Bayesian
    models are also present in the CFA models.
  • Does an empirical sample containing pure linear
    dependencies have higher CFA parameter estimates
    than sample containing less linear dependencies?
  • We conducted confirmatory factor analysis with
    the growth-oriented atmosphere model and examined
    the differences between the linear (D21) and
    non-linear (D23) factor covariance matrixes.
  • The results showed that the CFA model performed
    better with the linear sample.

73
Results
A_Example 1
  • Is there difference between substantive
    interpretations of the results of CFA and BDM
    with linear and non-linear samples?
  • Both Bayesian dependency models did not support
    the second theoretical assumption about the
    relationship between Incentive value of the job
    (INV) and Know-how developing (DEV) and Valuation
    of the job (VAL).

74
Results
A_Example 1
75
Results
A_Example 1
  • Is there difference between substantive
    interpretations of the results of CFA and BDM
    with linear and non-linear samples?
  • The second observation is that the fourth
    theoretical assumption about the negative
    influence of Psychic stress (PSY) on all the
    other factors is only partially supported in both
    Bayesian models.

76
Results
A_Example 1
77
Results
A_Example 1
  • Is there difference between substantive
    interpretations of the results of CFA and BDM
    with linear and non-linear samples?
  • Finally, the linear sample (D21) has in most
    cases higher CFA parameter estimates than the
    non-linear sample.

78
Conclusions
A_Example 1
  • This study investigated the number of non-linear
    and multi-modal relationships between variables
    in various real-world empirical Growth-oriented
    Atmosphere data.
  • Investigation of two independent empirical data
    (n 2430 and n 762) showed that only 39 per
    cent of all dependencies between variables were
    purely linear (linear mode, linear mean,
    unimodal).
  • Nine per cent of dependencies were purely
    non-linear (non-linear mode, non-linear mean,
    multimodal).
  • Multimodality was the most common reason for
    violation of linearity in both datasets.

79
Conclusions
A_Example 1
  • Two subgroups of the latter data were identified
    as linear (D21, n 447) and non-linear (D23,
    n 208).
  • Both correlational analysis and Bayesian
    dependency modeling were applied to these data in
    order to investigate relationships between
    fourteen factors of the growth-oriented
    atmosphere model.
  • Our conclusion that is based on the preliminary
    analysis of two relatively small empirical
    samples is that descriptive power of traditional
    linear models (e.g., correlational analysis) is
    sufficient with non-linear data (pure linear
    dependencies vary between 16.2 - 23.9 ).

80
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification Modeling
  • Bayesian Dependency Modeling
  • Bayesian Unsupervised Model-based Visualization

81
Bayesian Classification Modeling
  • Which variables are the best predictors for
    different group memberships (e.g., A or C group,
    gender, productivity, level of giftedness).
  • In the classification process, the automatic
    search is looking for the best set of variables
    to predict the class variable for each data item.

82
Bayesian Classification Modeling
  • The search procedure resembles the traditional
    linear discriminant analysis (LDA, Huberty, 1994,
    118-126), but the implementation is totally
    different.
  • For example, a variable selection problem that is
    addressed with forward, backward or stepwise
    selection procedure in LDA is replaced with a
    genetic algorithm approach (e.g., Hilario,
    Kalousisa, Pradosa Binzb, 2004 Hsu, 2004) in
    the Bayesian classification modeling.

83
Bayesian Classification Modeling
  • The genetic algorithm approach means that
    variable selection is not limited to one (or two
    or three) specific approach instead many
    approaches and their combinations are exploited.
  • One possible approach is to begin with the
    presumption that the models (i.e., possible
    predictor variable combinations) that resemble
    each other a lot (i.e., have almost same
    variables and discretizations) are likely to be
    almost equally good.
  • This leads to a search strategy in which models
    that resemble the current best model are selected
    for comparison, instead of picking models
    randomly.

84
Bayesian Classification Modeling
  • Another approach is to abandon the habit of
    always rejecting the weakest model and instead
    collect a set of relatively good models.
  • The next step is to combine the best parts of
    these models so that the resulting combined model
    is better than any of the original models.
  • B-Course is capable of mobilizing many more
    viable approaches, for example, rejecting the
    better model (algorithms like hill climbing,
    simulated annealing) or trying to avoid picking
    similar model twice (tabu search).

85
C_Example 3 Motivational Predictors for Study
Group Membership
  • Researcher picked from a student population (N
    240) a sub sample (one study group) (n 23) for
    closer investigation.
  • His goal was to study how students self-reported
    learning motivation was related to academic
    success of group works.
  • The sub sample consisted of five groups that
    studens had formed by themselves in the early
    stages of their studies.

86
C_Example 3 Motivational Predictors for Study
Group Membership
  • All the participants (n 23) filled the
    Abilities for Professional Learning Questionnaire
    (Ruohotie, 2002 Nokelainen Ruohotie, 2002)
    that measures their motivational level.
  • See research article 3 for item descriptions.
  • Researcher interviewed the participants to
    profile the groups.

87
C_Example 3 Motivational Predictors for Study
Group Membership
  • Classification analysis was conducted using class
    membership as class variable.
  • Aim of the analysis was to learn which of the
    APLQ items (i.e., learning motivation dimensions)
    would best predict differencies between the
    groups.
  • Naïve Bayes Network that is produced in the
    analysis is used to examine special features of
    the groups.
  • In addition, it allows groupwise comparison.

88
C_Example 3 Motivational Predictors for Study
Group Membership
Sample size n 23
Classification accuracy 60.87.
Common components V2, V3, V4, V6, V7, V8, V10,
V11, V12, V13, V14, V16, V17, V20, V21, V22, V23,
V25, V26, V28
89
C_Example 3 Motivational Predictors for Study
Group Membership
90
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
  • The study investigated how components of mobile
    learning predict the use of different computer
    devices (Syvänen, Nokelainen, Ahonen Turunen,
    2003).
  • Sample (n 87) consisted of 5th and 6th grade
    Finnish elementary school students.

91
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
  • Classification variable was device that has
    three values
  • 1 Handheld computer2 Portable computer3
    Desktop computer
  • Fourteen questions were asked from the students
    to measure their mobile learning experineces.

92
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
  • Variable Description
  • DEEP Deep approach
  • HELPSEE Help-seeking
  • MANAGEM Learning management
  • CREATIV Creativity in problem solving
  • EFFECTI Perceived effectiveness
  • SELFEFF Self-efficacy
  • SEARCH Knowledge seeking
  • SHARE Knowledge sharing
  • DUALISM Conception of knowledge
  • SURFACE Surface approach
  • CONFIDENCE Computer confidence
  • PEERLEARN Peer learning
  • EASINESS Perceived easiness of use
  • CONSTRUC Knowledge construction

93
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
Sample size n 87
Classification accuracy 62.32.
Common components DUALISM, SURFACE, CONFIDENCE,
PEERLEARN, EASINESS, CONSTRUC
94
C_Example 4 Mobile Learning Components
Predicting the Use of Three Types of Computers
95
Investigating the Influence of Attribution Styles
on the Development of Mathematical Talent
A_Example 2
  • Petri Nokelainen
  • Research Centre for Vocational Education
  • University of Tampere, Finland
  • Kirsi Tirri
  • Department of Practical Theology
  • University of Helsinki, Finland
  • Hanna-Leena Merenti-Välimäki
  • Espoo-Vantaa Institute of Technology, Finland

See printed version!
(Nokelainen, Tirri Merenti-Välimäki, 2007.)
96
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification Modeling
  • Bayesian Dependency Modeling
  • Bayesian Unsupervised Model-based Visualization

97
Bayesian Dependency Modeling
  • Bayesian dependency modeling (BDM) is applied to
    examine dependencies between variables by both
    their visual representation and probability ratio
    of each dependency
  • Graphical visualization of Bayesian network
    contains two components
  • 1) Observed variables visualized as ellipses.
  • 2) Dependences visualized as lines between nodes.

98
C_Example 5 Calculation of Bayesian Score
  • Next, I will present how Bayesian score (BS),
    that is, the probability of the model P(MD), is
    firstly calculated and secondly compared for the
    two models presented in the figure

Figure 9. An Example of Two Competing Bayesian
Network Structures
(Nokelainen, 2008, p. 121)
99
C_Example 5 Calculation of Bayesian Score
  • Let us assume that we have the following data
  • x1 x2
  • 1 1
  • 1 1
  • 2 2
  • 1 2
  • 1 1
  • Model 1 (M1) represents the two variables, x1 and
    x2 respectively, without statistical dependency,
    and the model 2 (M2) represents the two variables
    with a dependency (i.e., with a connecting arc).
  • The binomial data might be a result of an
    experiment, where the five participants have
    solved a job related task before (x1) and after
    (x2) a vocational training period.

100
C_Example 5 Calculation of Bayesian Score
  • In order to calculate P(M1,2D), we need to solve
    P(DM1,2) for the two models M1 and M2.
  • Probability of the data given the model is solved
    by using the following marginal likelihood
    equation (Congdon, 2001, p. 473 Myllymäki,
    Silander, Tirri, Uronen, 2001 Myllymäki
    Tirri, 1998, p. 63)

101
C_Example 5 Calculation of Bayesian Score
  • In the Equation 4, following symbols are used
  • n is the number of variables (i indexes variables
    from 1 to n)
  • r1 is the number of values in ith variable (k
    indexes these values from 1 to ri
  • qi is the number of possible configurations of
    parents of ith variable
  • Nij describes the number of rows in the data that
    have jth configuration for parents of ith
    variable
  • Nijk describes how many rows in the data have
    kth value for the ith variable also have jth
    configuration for parents of Ith variable
  • N is the equivalent sample size set to be the
    average number of values divided by two.
  • The marginal likelihood equation produces a
    Bayesian Dirichlet score that allows model
    comparison (Heckerman et al., 1995 Tirri, 1997
    Neapolitan Morris, 2004).

102
C_Example 5 Calculation of Bayesian Score
  • First, I will calculate P(DM1) given the values
    of variable x1

(2/2)/1
(2/2)/21
x1 x2 1 1 1 1 2 2 1 2 1 1
103
C_Example 5 Calculation of Bayesian Score
  • Second, the values for the x2 are calculated

104
C_Example 5 Calculation of Bayesian Score
  • The BS, probability for the first model P(M1D),
    is 0.027 0.012 ? 0.000324.

105
C_Example 5 Calculation of Bayesian Score
  • Third, P(DM2) is calculated given the values of
    variable x1

106
C_Example 5 Calculation of Bayesian Score
  • Fourth, the values for the first parent
    configuration (x1 1) are calculated

107
C_Example 5 Calculation of Bayesian Score
  • Fifth, the values for the second parent
    configuration (x1 2) are calculated

108
C_Example 5 Calculation of Bayesian Score
  • The BS, probability for the second model P(M2D),
    is 0.027 0.027 0.500 ? 0.000365.

109
C_Example 5 Calculation of Bayesian Score
  • Bayes theorem enables the calculation of the
    ratio of the two models, M1 and M2.
  • As both models share the same a priori
    probability, P(M1) P(M2), both probabilities
    are canceled out.
  • Also the probability of the data P(D) is canceled
    out in the following equation as it appears in
    both formulas in the same position

110
C_Example 5 Calculation of Bayesian Score
  • The result of model comparison shows that since
    the ratio is less than 1, the M2 is more probable
    than M1.
  • This result becomes explicit when we investigate
    the sample data more closely.
  • Even a sample this small (n 5) shows that there
    is a clear tendency between the values of x1 and
    x2 (four out of five value pairs are identical).

x1 x2 1 1 1 1 2 2 1 2 1 1
111
C_Example 6 Modeling of Prerequisites for
Organizational Learning
  • Staff of five Finnish police organization (n
    281) filled out Growth Oriented Atmosphere
    Questionnaire (Luoma, Nokelainen Ruohotie,
    2002).
  • BDM was applied to study how the theoretical
    model is represented in the Bayesian Network.

112
C_Example 6 Modeling of Prerequisites for
Organizational Learning
113
C_Example 6 Modeling of Prerequisites for
Organizational Learning
114
Investigating Subordinates' Evaluations on their
Superiors Emotional Leadership
A_Example 3
  • Petri Nokelainen
  • Pekka Ruohotie
  • Research Centre for Vocational Education
  • University of Tampere, Finland

See printed version!
(Nokelainen Ruohotie, 2005.)
115
Conceptual Modeling of Self-rated
Intelligence-profile
A_Example 4
  • Kirsi Tirri and Erkki Komulainen
  • University of Helsinki, Finland
  • Petri Nokelainen and Henry Tirri
  • Helsinki University of Technology, Finland

See printed version!
(Tirri, K., Komulainen, Nokelainen Tirri, H.,
2002.)
116
Outline
  • Research Overview
  • Introduction to Bayesian Modeling
  • Investigating Non-linearities with Bayesian
    Networks
  • Bayesian Classification Modeling
  • Bayesian Dependency Modeling
  • Bayesian Unsupervised Model-based Visualization

117
Bayesian Unsupervised Model-based Visualization
  • Dispersion of single data vectors in
    three-dimensional space in order to find how the
    factors are interrelated in individual level.
  • The data is mapped into different set of
    dimensions according to the optimized solution
    from which Bayesian algorithm produced one
    optimal model.
  • The three-dimensional model is plotted into
    series of two-dimensional figures each presenting
    one dimension at the time.
  • BayMiner http//www.bayminer.com

118
(No Transcript)
119
Investigating Growth Prerequisites in a Finnish
Polytechnic Institute of Higher Education
A_Example 5
  • Petri Nokelainen
  • Pekka Ruohotie
  • Research Centre for Vocational Education
  • University of Tampere, Finland

See printed version!
(Nokelainen Ruohotie, in press, to appear in
the Journal of Workplace Learning.)
120
Links
  • Research Centre for Vocational Education ltURL
    http//www.uta.fi/aktkk gt
  • Complex Systems Computation Group ltURL
    http//cosco.hiit.fi gt
  • EDUTECH ltURL http//cosco.hiit.fi/edutech gt
  • B-COURSE ltURL http//b-course.hiit.fi gt
  • BayMiner ltURL http//www.bayminer.com gt

121
References
  • Abelson, R. P. (1995). Statistics as Principled
    Argument. Hillsdale, NJ Lawrence Erlbaum
    Associates.
  • Anderson, J. (1995). Cognitive Psychology and Its
    Implications. Freeman New York.
  • Bayes, T. (1763). An essay towards solving a
    problem in the doctrine of chances. Philosphical
    Transactions of the Royal Society, 53, 370-418.
  • Bernardo, J., Smith, A. (2000). Bayesian
    theory. New York Wiley.
  • Brannen, J. (2004). Working qualitatively and
    quantitatively. In C. Seale, G. Gobo, J. Gubrium,
    D. Silverman (Eds), Qualitative Research
    Practice (pp. 312-326). London Sage.
  • Fisher, R. (1935). The design of experiments.
    Edinburgh Oliver Boyd.

122
References
  • Gigerenzer, G. (2000). Adaptive thinking. New
    York Oxford University Press.
  • Gill, J. (2002). Bayesian methods. A Social and
    Behavioral Sciences Approach. Boca Raton Chapman
    Hall/CRC.
  • Gigerenzer, G., Krauss, S., Vitouch, O. (2004).
    The null ritual What you always wanted to know
    about significance testing but were afraid to
    ask. In D. Kaplan (Ed.), The SAGE handbook of
    quantitative methodology for the social sciences
    (pp. 391-408). Thousand Oaks Sage.
  • Gobo, G. (2004). Sampling, representativeness and
    generalizability. In C. Seale, J. F. Gubrium, G.
    Gobo, D. Silverman (Eds.), Qualitative Research
    Practice (pp. 435-456). London Sage.
  • Hair, J. F., Anderson, R. E., Tatham, R. L.,
    Black, W. C. (1998). Multivariate Data Analysis.
    Fifth edition. Englewood Cliffs, NJ Prentice
    Hall.

123
References
  • Heckerman, D., Geiger, D., Chickering, D.
    (1995). Learning Bayesian networks The
    combination of knowledge and statistical data.
    Machine Learning, 20(3), 197-243.
  • Lavine, M. L. (1999). What is Bayesian Statistics
    and Why Everything Else is Wrong. The Journal of
    Undergraduate Mathematics and Its Applications,
    20, 165-174.
  • Lindley, D. V. (1971). Making Decisions. London
    Wiley.
  • Lindley, D. V. (2001). Harold Jeffreys. In C. C.
    Heyde E. Seneta (Eds.), Statisticians of the
    Centuries, (pp. 402-405). New York Springer.
  • Luoma, M., Nokelainen, P., Ruohotie, P. (2003,
    April). Learning Strategies for Police
    Organization - Modeling Organizational Learning
    Prerequisites. Paper presented at the Annual
    Meeting of American Educational Research
    Association (AERA 2002). New Orleans, USA.

124
References
  • Myllymäki, P., Silander, T., Tirri, H., Uronen,
    P. (2002). B-Course A Web-Based Tool for
    Bayesian and Causal Data Analysis. International
    Journal on Artificial Intelligence Tools, 11(3),
    369-387.
  • Myllymäki, P., Tirri, H. (1998).
    Bayes-verkkojen mahdollisuudet Possibilities of
    Bayesian Networks. Teknologiakatsaus 58/98.
    Helsinki TEKES.

125
References
  • Nokelainen, P. (2008). Modeling of Professional
    Growth and Learning Bayesian Approach. Tampere
    Tampere University Press.
  • Nokelainen, P., Ruohotie, P. (2005).
    Investigating the Construct Validity of the
    Leadership Competence and Characteristics Scale.
    In the Proceedings of International Research on
    Work and Learning 2005 Conference, Sydney,
    Australia.
  • Nokelainen, P., Ruohotie, P. (In press).
    Investigating Growth Prerequisites in a Finnish
    Polytechnic for Higher Education. To appear in
    the Journal of Workplace Learning.

126
References
  • Nokelainen, P., Silander, T., Ruohotie, P,
    Tirri, H. (2003, August). Investigating
    Non-linearities with Bayesian Networks. Paper
    presented at 111th Annual Convention of the
    American Psychology Association, Division of
    Evaluation, Measurement and Statistics. Toronto,
    Canada.
  • Nokelainen, P., Silander, T., Ruohotie, P.,
    Tirri, H. (2007). Investigating the Number of
    Non-linear and Multi-modal Relationships Between
    Observed Variables Measuring A Growth-oriented
    Atmosphere. Quality Quantity, 41(6), 869-890.

127
References
  • Nokelainen, P., Tirri, K., Merenti-Välimäki,
    H.-L. (2007). Investigating the Influence of
    Attribution Styles on the Development of
    Mathematical Talent. Gifted Child Quarterly,
    51(1), 64-81.
  • Syvänen, A., Nokelainen, P., Ahonen, M.,
    Turunen, H. (2003, August). Approaches to
    Assessing Mobile Learning Components. Paper
    presented at 10th Biennal Conference of the
    European Association for Research on Learning and
    Instruction. Padova, Italy.
  • Thompson, B. (1994). Guidelines for authors.
    Educational and Psychological Measurement, 54(4),
    837-847.

128
References
  • Tirri, K., Komulainen, E., Nokelainen, P.,
    Tirri, H. (2002). Conceptual Modeling of
    Self-Rated Intelligence-Profile. In Proceedings
    of the 2nd International Self-Concept Research
    Conference. University of Western Sydney, Self
    Research Center.
  • de Vaus, D. A. (2004). Research Design in Social
    Research. Third edition. London Sage.
Write a Comment
User Comments (0)
About PowerShow.com