Data Mining for an Educational Webbased System - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Data Mining for an Educational Webbased System

Description:

Intelligent automated tools needed to discover relevant, useful, and interesting ... discovered rules to produce more intelligent system. 9/16/09. Thesis ... – PowerPoint PPT presentation

Number of Views:888

Avg rating:3.0/5.0

Slides: 43

Provided by: rje7

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining for an Educational Webbased System

1
Data Mining for anEducational Web-based System

Behrouz Minaei
Department of Computer Science and Engineering
Thesis Proposal
January 30th 2004

2
Topics

Statement of problem
Classification (Prediction student performance)
Clustering (Ensembles of multiple clusterings)
Additional Proposed work
Tentative Schedule

3
Statement of problem

Statement of problem
LON-CAPA
Data Mining
Data Preprocessing
Contributions
G. Albertelli, B. Minaei-Bigdoli, W.F. Punch, G.
Kortemeyer, and E. Kashy, Concept Feedback In
Computer-Assisted Assignments, Proceedings of
the (IEEE/ASEE) Frontiers in Education
conference, 2002 Boston
M. Hall, J. Parker, B. Minaei-Bigdoli,G.
Albertelli, G. Kortemeyer, and E. Kashy,
Gathering and Timely Use of Feedback from
Individualized On-line Work with an Open-Source
CMS submitted to (IEEE/ASEE) FIE 2004 Frontier
In Education, Oct. 2004 Lavannah
Classification (Prediction student performance)
Clustering (Ensembles of multiple clusterings)
Additional proposed work
Tentative Schedule

4
LON-CAPA

This research is a part of the latest online
educational system developed at Michigan State
University (MSU), the Learning Online Network
with Computer-Assisted Personalized Approach
(LON-CAPA).
Learning Content Management System
9 high schools, 2 community colleges, and 17
universities nationwide
Assessment System
Online assessment with immediate feedback and
multiple tries
Different students get different versions of the
same problem
Different options, graphs, images, numbers, or
formulas
Open-Source and Free (GPL, Runs on Linux)

5
LON-CAPA Data

Three kinds of growing data sets
Educational resources web pages, demonstrations,
simulations, individualized problems, quizzes,
and examinations.
Information about users who create, modify,
assess, or use these resources.
Data about how students use and access the
educational materials

6
MSU Fall 2003

40 courses used LON-CAPA at MSU
Total student enrollment approximately 3,067 (out
of 13,400 total global student-users)
Disciplines included Advertising, Biochemistry,
Biology, Chemistry, Finance, Geology, Math,
Physics, Plant Biology, Statistics for Psychology

7
Statement of problem

LON-CAPA collects data for every single access to
the resources in both activity log and student
database
Logs are not only huge but also distributed and
specific to a web-based educational system
(LON-CAPA)
Intelligent automated tools needed to discover
relevant, useful, and interesting patterns
Apply the discovered rules to produce more
intelligent system

8
Knowledge Discovery Process

Data Integration, removing inconsistency,
Data Cleansing, correcting errors, missing values
Discretization, transform continuous to
categorical
Feature Selection, features are more relevant
Mining process, rule discovery
Post-processing,
Large set rules ? simplify
1) More comprehensible, 2) More interesting
Use combination of objective and subjective
approaches

9
Data Mining Tasks

Classification
The goal is to predict the class variable based
on the feature values of samples Avoid
Overfitting
Clustering (unsupervised learning)
Association Analysis
Find the binary relationship among the data items
Any feature variable can occur both in antecedent
and in the consequent of a rule.

10
Contributions (1)

Our claim is that data mining can help to design
better and more intelligent educational web-based
environment

Can help instructor to design the course more
effectively, detect anomaly
Can help students to use the resources more
efficiently
11

Contributions (2)
12
Contributions (3)
Can find some associative rules between
students educational activities
Can be used to identify those students who are at
risk, especially in very large classes
Can help instructors predict the approaches that
students will take for some types of problems
13
Prediction student performance

Statement of problem
Classification
Combination of Classifiers
Weighting the features
Using a Genetic Algorithm to find the best set of
weights
B. Minaei-Bidgoli, W.F. Punch, Using Genetic
Algorithms for Data Mining Optimization in an
Educational Web-based System, GECCO 2003,
2252-2263, July 2003 Chicago.
B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer,
W.F. Punch, Predicting Student Performance An
Application of Data Mining Methods with an
educational Web-based System, (IEEE/ASEE) FIE
2003 Frontier In Education, Nov. 2003 Boulder
Clustering (Ensembles of multiple clusterings)
Proposed work
Tentative Schedule

14
Data Set PHY183 SS02

227 students
12 Homework sets
184 Problems
80 MB activity log
26 MB useful data
220,000 transactions
Extracted Features

Total number of correct answers. (Success rate)
Success at the first try
Number of attempts to get answer
Time spent until correct
Total time spent on the problem
Participating in the communication mechanisms

15
Class Labels (3 possibilities)
2-Classes

3-Classes

9-Classes
16
Classifiers

Non-Tree Classifiers (Using MATLAB)
Bayesian Classifier
1NN
kNN
Multi-Layer Perceptron
Parzen Window
Combination of Multiple Classifiers (CMC)
Genetic Algorithm (GA), Optimizer
Decision Tree-Based Software
C5.0 (RuleQuest ltltC4.5ltltID3)
CART (Salford-systems)
QUEST (Univ. of Wisconsin)
CRUISE use an unbiased variable selection
technique

17
Fitness/Evaluation Function

5 classifiers
Multi-Layer Perceptron 2 Minutes
Bayesian Classifier
1NN
kNN
Parzen Window
CMC 3 seconds
Divide data into training and test sets (10-fold
Cross-Validation)
Fitness function performance achieved by
classifier

18
Individual Representation

The GA Toolbox supports binary, integer and
floating-point chromosome representations.
Chrom crtrp(N, FieldDR) creates a random
real-valued matrix of N x d, where N is number
of individuals (200) and FieldDR is a matrix of
size 2 x d and contains the boundaries of each
variable of an individual.
FieldDR 0 0 0 0 0 0 lower bound
1 1 1 1 1 1 upper bound
Chrom 0.23 0.17 0.95 0.38 0.06 0.26
0.35 0.09 0.43 0.64 0.20
0.54
0.50 0.10 0.09 0.65 0.68
0.46
0.21 0.29 0.89 0.48 0.63
0.89

19
Results of using GA
20
GA Optimization Results
21
Features importance
22
Contribution of the classification

A new approach to evaluating student usage of
web-based instruction
An approach that is easily adaptable to different
types of courses, different population sizes, and
different attributes to be analyzed
Rigorous application of known classifiers as a
means of analyzing and comparing use and
performance of students who have taken a
technical course that was partially/completely
administered via the web

23
Clustering

Statement of problem
Classification (Prediction student performance)
Clustering (Ensembles of multiple clusterings)
B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Ensembles of Partitions via Data Resampling,
Proc. Intl. Conf. on Information Technology,
ITCC/IEEE 2004, in press
B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Effect of the Resampling Methods on Clustering
Ensemble Efficacy, prepared to submit to Intl.
Conf. on Machine Learning Models, Technologies
and Applications, 2004
A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
Punch, Adaptive Clustering Ensembles, submitted
to Intl. Conf on Pattern Recognition, ICPR 2004
Proposed work
Tentative Schedule

24
Motivation

Combinations of classifiers proved to be very
effective in supervised learning framework, e.g.
bagging and boosting algorithms
In LON-CAPA, the course and student data are
distributed
Distributed data mining requires efficient
algorithms capable to integrate the solutions
obtained from multiple sources of data and
features
Ensembles of clusterings can provide novel,
robust, and stable solutions

25
Taxonomy of Clustering Combination Approaches
26
Resampling Methods

Bootstrapping (Sampling with replacement)
Create an artificial list by randomly drawing N
elements from that list. Some elements will be
picked more than once.
Statistically on average 37 of elements are
repeated
Subsampling (Sampling without replacement)
Control over the size of subsample

27
Related work on bootstrap partitioning

Estimate the number of clusters
(Jain Moreau1987), (Fridlyand Dudoit 2001),
Clustering validity/reliability
(Jain Moreau1987), (Fischer Buhmann 2003)
Find a measure for clustering stability
(Ben-Hur et. al, 2002),
Clustering combination
(Fridlyand Dudoit 2001) (Fischer Buhmann 2003)
(Monti, et al., 2003.)

28
Experiment Data sets
29
Two-spiral and Halfrings data sets
Halfrings 400 patterns (100-300)
2-Spirals 200 patterns (100-100)
30
Bootstrap results on Iris
31
Subsampling on Halfrings
32
Subsampling results on Galaxy/Star
33
Error Rate for Individual Clustering
34
Summary of the best results of Bootstrap
35
(No Transcript)
36
Additional Proposed work

Statement of problem
Classification (Prediction student performance)
Clustering (Ensembles of multiple clusterings)
Additional Proposed work
Association Analysis
Dynamic mining
Tentative Schedule

37
A sequence-based clustering

The problem
given students browsing data and course
contents, find clusters of learners with similar
behavior
order of browsed pages matters
P ? P ? R1? R2 ? P ? A
R1? R2 ? P ? A
P ? A
P ? P ? P ? P ? P ? P ? P ? P ? P ? A
P ? P ? P ? P ? P
R3 ? R2 ? P
Cluster students based on a similarity function

38
Web usage mining

Many techniques have been investigated in the
e-commerce and CRM
Some can be adapted, some can not
The goals are different,
The user model is different
Analyzing students interactions with the
LON-CAPA and take actions accordingly. It is the
path traversal pattern or similar to web
sequential pattern mining or web log mining.

39
Association Analysis postprocessing

Association rules mining studies the frequency of
items occurring together in a given set of data.
Solving the discretization problem for continuous
features
Post analysis of the discovered knowledge in
terms of the interestingness, usefulness and so
on. What is useful or interesting is a domain
dependent, need to talk to LON-CAPA
instructors/authors
Strategic use of data and discovered knowledge

40
Dynamic mining LON-CAPA Examples
You are about to start a test. Other students
similar to you, who succeeded in this test, have
also accessed Section 5 of Chapter 3. You did
not. Would you like to access it now before
attempting the test? Yes No
Based on your time access to solve the problem
(Circular Motion), It seems that you are not
thinking about the problem, It is better to see
the following pages and then submit your
answers Motion in 2 Dimensions Force and Motion
Momentum and Collisions
Someone answered the question you posted on the
Bulletin Board yesterday. Would you like to read
it now? Yes No
Degree of difficulty of problem 3 in homework
set 5 jumped into greater than 90 in the first
5 hours of student access? There might be
something wrong in designing the problem. Would
you like to revise it now? Yes
No
41
Conclusion Enhancing web-based learning

L-C servers are tracking students activities in
large logs
The knowledge discovered could be analyzed and
evaluated by knowledge experts (off-line mining)
Integrated mining The patterns discovered are
fed back to a system that seamlessly and
transparently would make systems behave
intelligently.
We could pass the data through a magic box to
find some obscure patterns.
Tools to recommend tasks, automatically adapt
course materials
Tools can be personalized, manually or
automatically