Data Mining for an Educational Webbased System - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Data Mining for an Educational Webbased System

Description:

Intelligent automated tools needed to discover relevant, useful, and interesting ... discovered rules to produce more intelligent system. 9/16/09. Thesis ... – PowerPoint PPT presentation

Number of Views:888
Avg rating:3.0/5.0
Slides: 43
Provided by: rje7
Category:

less

Transcript and Presenter's Notes

Title: Data Mining for an Educational Webbased System


1
Data Mining for anEducational Web-based System
  • Behrouz Minaei
  • Department of Computer Science and Engineering
  • Thesis Proposal
  • January 30th 2004

2
Topics
  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional Proposed work
  • Tentative Schedule

3
Statement of problem
  • Statement of problem
  • LON-CAPA
  • Data Mining
  • Data Preprocessing
  • Contributions
  • G. Albertelli, B. Minaei-Bigdoli, W.F. Punch, G.
    Kortemeyer, and E. Kashy, Concept Feedback In
    Computer-Assisted Assignments, Proceedings of
    the (IEEE/ASEE) Frontiers in Education
    conference, 2002 Boston
  • M. Hall, J. Parker, B. Minaei-Bigdoli,G.
    Albertelli, G. Kortemeyer, and E. Kashy,
    Gathering and Timely Use of Feedback from
    Individualized On-line Work with an Open-Source
    CMS submitted to (IEEE/ASEE) FIE 2004 Frontier
    In Education, Oct. 2004 Lavannah
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional proposed work
  • Tentative Schedule

4
LON-CAPA
  • This research is a part of the latest online
    educational system developed at Michigan State
    University (MSU), the Learning Online Network
    with Computer-Assisted Personalized Approach
    (LON-CAPA).
  • Learning Content Management System
  • 9 high schools, 2 community colleges, and 17
    universities nationwide
  • Assessment System
  • Online assessment with immediate feedback and
    multiple tries
  • Different students get different versions of the
    same problem
  • Different options, graphs, images, numbers, or
    formulas
  • Open-Source and Free (GPL, Runs on Linux)

5
LON-CAPA Data
  • Three kinds of growing data sets
  • Educational resources web pages, demonstrations,
    simulations, individualized problems, quizzes,
    and examinations.
  • Information about users who create, modify,
    assess, or use these resources.
  • Data about how students use and access the
    educational materials

6
MSU Fall 2003
  • 40 courses used LON-CAPA at MSU
  • Total student enrollment approximately 3,067 (out
    of 13,400 total global student-users)
  • Disciplines included Advertising, Biochemistry,
    Biology, Chemistry, Finance, Geology, Math,
    Physics, Plant Biology, Statistics for Psychology

7
Statement of problem
  • LON-CAPA collects data for every single access to
    the resources in both activity log and student
    database
  • Logs are not only huge but also distributed and
    specific to a web-based educational system
    (LON-CAPA)
  • Intelligent automated tools needed to discover
    relevant, useful, and interesting patterns
  • Apply the discovered rules to produce more
    intelligent system

8
Knowledge Discovery Process
  • Data Integration, removing inconsistency,
  • Data Cleansing, correcting errors, missing values
  • Discretization, transform continuous to
    categorical
  • Feature Selection, features are more relevant
  • Mining process, rule discovery
  • Post-processing,
  • Large set rules ? simplify
  • 1) More comprehensible, 2) More interesting
  • Use combination of objective and subjective
    approaches

9
Data Mining Tasks
  • Classification
  • The goal is to predict the class variable based
    on the feature values of samples Avoid
    Overfitting
  • Clustering (unsupervised learning)
  • Association Analysis
  • Find the binary relationship among the data items
  • Any feature variable can occur both in antecedent
    and in the consequent of a rule.

10
Contributions (1)
  • Our claim is that data mining can help to design
    better and more intelligent educational web-based
    environment

Can help instructor to design the course more
effectively, detect anomaly
Can help students to use the resources more
efficiently
11

Contributions (2)
12
Contributions (3)
Can find some associative rules between
students educational activities
Can be used to identify those students who are at
risk, especially in very large classes
Can help instructors predict the approaches that
students will take for some types of problems
13
Prediction student performance
  • Statement of problem
  • Classification
  • Combination of Classifiers
  • Weighting the features
  • Using a Genetic Algorithm to find the best set of
    weights
  • B. Minaei-Bidgoli, W.F. Punch, Using Genetic
    Algorithms for Data Mining Optimization in an
    Educational Web-based System, GECCO 2003,
    2252-2263, July 2003 Chicago.
  • B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer,
    W.F. Punch, Predicting Student Performance An
    Application of Data Mining Methods with an
    educational Web-based System, (IEEE/ASEE) FIE
    2003 Frontier In Education, Nov. 2003 Boulder
  • Clustering (Ensembles of multiple clusterings)
  • Proposed work
  • Tentative Schedule

14
Data Set PHY183 SS02
  • 227 students
  • 12 Homework sets
  • 184 Problems
  • 80 MB activity log
  • 26 MB useful data
  • 220,000 transactions
  • Extracted Features
  • Total number of correct answers. (Success rate)
  • Success at the first try
  • Number of attempts to get answer
  • Time spent until correct
  • Total time spent on the problem
  • Participating in the communication mechanisms

15
Class Labels (3 possibilities)
2-Classes
 
3-Classes
 
9-Classes
16
Classifiers
  • Non-Tree Classifiers (Using MATLAB)
  • Bayesian Classifier
  • 1NN
  • kNN
  • Multi-Layer Perceptron
  • Parzen Window
  • Combination of Multiple Classifiers (CMC)
  • Genetic Algorithm (GA), Optimizer
  • Decision Tree-Based Software
  • C5.0 (RuleQuest ltltC4.5ltltID3)
  • CART (Salford-systems)
  • QUEST (Univ. of Wisconsin)
  • CRUISE use an unbiased variable selection
    technique

17
Fitness/Evaluation Function
  • 5 classifiers
  • Multi-Layer Perceptron 2 Minutes
  • Bayesian Classifier
  • 1NN
  • kNN
  • Parzen Window
  • CMC 3 seconds
  • Divide data into training and test sets (10-fold
    Cross-Validation)
  • Fitness function performance achieved by
    classifier

18
Individual Representation
  • The GA Toolbox supports binary, integer and
    floating-point chromosome representations.
  • Chrom crtrp(N, FieldDR) creates a random
    real-valued matrix of N x d, where N is number
    of individuals (200) and FieldDR is a matrix of
    size 2 x d and contains the boundaries of each
    variable of an individual.
  • FieldDR 0 0 0 0 0 0 lower bound
  • 1 1 1 1 1 1 upper bound
  • Chrom 0.23 0.17 0.95 0.38 0.06 0.26
  • 0.35 0.09 0.43 0.64 0.20
    0.54
  • 0.50 0.10 0.09 0.65 0.68
    0.46
  • 0.21 0.29 0.89 0.48 0.63
    0.89

19
Results of using GA
20
GA Optimization Results
21
Features importance
22
Contribution of the classification
  • A new approach to evaluating student usage of
    web-based instruction
  • An approach that is easily adaptable to different
    types of courses, different population sizes, and
    different attributes to be analyzed
  • Rigorous application of known classifiers as a
    means of analyzing and comparing use and
    performance of students who have taken a
    technical course that was partially/completely
    administered via the web

23
Clustering
  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
    Ensembles of Partitions via Data Resampling,
    Proc. Intl. Conf. on Information Technology,
    ITCC/IEEE 2004, in press
  • B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
    Effect of the Resampling Methods on Clustering
    Ensemble Efficacy, prepared to submit to Intl.
    Conf. on Machine Learning Models, Technologies
    and Applications, 2004
  • A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
    Punch, Adaptive Clustering Ensembles, submitted
    to Intl. Conf on Pattern Recognition, ICPR 2004
  • Proposed work
  • Tentative Schedule

24
Motivation
  • Combinations of classifiers proved to be very
    effective in supervised learning framework, e.g.
    bagging and boosting algorithms
  • In LON-CAPA, the course and student data are
    distributed
  • Distributed data mining requires efficient
    algorithms capable to integrate the solutions
    obtained from multiple sources of data and
    features
  • Ensembles of clusterings can provide novel,
    robust, and stable solutions

25
Taxonomy of Clustering Combination Approaches
26
Resampling Methods
  • Bootstrapping (Sampling with replacement)
  • Create an artificial list by randomly drawing N
    elements from that list. Some elements will be
    picked more than once.
  • Statistically on average 37 of elements are
    repeated
  • Subsampling (Sampling without replacement)
  • Control over the size of subsample

27
Related work on bootstrap partitioning
  • Estimate the number of clusters
  • (Jain Moreau1987), (Fridlyand Dudoit 2001),
  • Clustering validity/reliability
  • (Jain Moreau1987), (Fischer Buhmann 2003)
  • Find a measure for clustering stability
  • (Ben-Hur et. al, 2002),
  • Clustering combination
  • (Fridlyand Dudoit 2001) (Fischer Buhmann 2003)
  • (Monti, et al., 2003.)

28
Experiment Data sets
29
Two-spiral and Halfrings data sets
Halfrings 400 patterns (100-300)
2-Spirals 200 patterns (100-100)
30
Bootstrap results on Iris
31
Subsampling on Halfrings
32
Subsampling results on Galaxy/Star
33
Error Rate for Individual Clustering
34
Summary of the best results of Bootstrap
35
(No Transcript)
36
Additional Proposed work
  • Statement of problem
  • Classification (Prediction student performance)
  • Clustering (Ensembles of multiple clusterings)
  • Additional Proposed work
  • Association Analysis
  • Dynamic mining
  • Tentative Schedule

37
A sequence-based clustering
  • The problem
  • given students browsing data and course
    contents, find clusters of learners with similar
    behavior
  • order of browsed pages matters
  • P ? P ? R1? R2 ? P ? A
  • R1? R2 ? P ? A
  • P ? A
  • P ? P ? P ? P ? P ? P ? P ? P ? P ? A
  • P ? P ? P ? P ? P
  • R3 ? R2 ? P
  • Cluster students based on a similarity function

38
Web usage mining
  • Many techniques have been investigated in the
    e-commerce and CRM
  • Some can be adapted, some can not
  • The goals are different,
  • The user model is different
  • Analyzing students interactions with the
    LON-CAPA and take actions accordingly. It is the
    path traversal pattern or similar to web
    sequential pattern mining or web log mining.

39
Association Analysis postprocessing
  • Association rules mining studies the frequency of
    items occurring together in a given set of data.
  • Solving the discretization problem for continuous
    features
  • Post analysis of the discovered knowledge in
    terms of the interestingness, usefulness and so
    on. What is useful or interesting is a domain
    dependent, need to talk to LON-CAPA
    instructors/authors
  • Strategic use of data and discovered knowledge

40
Dynamic mining LON-CAPA Examples
You are about to start a test. Other students
similar to you, who succeeded in this test, have
also accessed Section 5 of Chapter 3. You did
not. Would you like to access it now before
attempting the test? Yes No
Based on your time access to solve the problem
(Circular Motion), It seems that you are not
thinking about the problem, It is better to see
the following pages and then submit your
answers Motion in 2 Dimensions Force and Motion
Momentum and Collisions
Someone answered the question you posted on the
Bulletin Board yesterday. Would you like to read
it now? Yes No
Degree of difficulty of problem 3 in homework
set 5 jumped into greater than 90 in the first
5 hours of student access? There might be
something wrong in designing the problem. Would
you like to revise it now? Yes
No
41
Conclusion Enhancing web-based learning
  • L-C servers are tracking students activities in
    large logs
  • The knowledge discovered could be analyzed and
    evaluated by knowledge experts (off-line mining)
  • Integrated mining The patterns discovered are
    fed back to a system that seamlessly and
    transparently would make systems behave
    intelligently.
  • We could pass the data through a magic box to
    find some obscure patterns.
  • Tools to recommend tasks, automatically adapt
    course materials
  • Tools can be personalized, manually or
    automatically

42
Tentative Schedule
Write a Comment
User Comments (0)
About PowerShow.com