Title: Data Mining for an Educational Webbased System
1Data Mining for anEducational Web-based System
- Behrouz Minaei
- Department of Computer Science and Engineering
- Thesis Proposal
- January 30th 2004
2Topics
- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional Proposed work
- Tentative Schedule
3Statement of problem
- Statement of problem
- LON-CAPA
- Data Mining
- Data Preprocessing
- Contributions
- G. Albertelli, B. Minaei-Bigdoli, W.F. Punch, G.
Kortemeyer, and E. Kashy, Concept Feedback In
Computer-Assisted Assignments, Proceedings of
the (IEEE/ASEE) Frontiers in Education
conference, 2002 Boston - M. Hall, J. Parker, B. Minaei-Bigdoli,G.
Albertelli, G. Kortemeyer, and E. Kashy,
Gathering and Timely Use of Feedback from
Individualized On-line Work with an Open-Source
CMS submitted to (IEEE/ASEE) FIE 2004 Frontier
In Education, Oct. 2004 Lavannah - Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional proposed work
- Tentative Schedule
4LON-CAPA
- This research is a part of the latest online
educational system developed at Michigan State
University (MSU), the Learning Online Network
with Computer-Assisted Personalized Approach
(LON-CAPA). - Learning Content Management System
- 9 high schools, 2 community colleges, and 17
universities nationwide - Assessment System
- Online assessment with immediate feedback and
multiple tries - Different students get different versions of the
same problem - Different options, graphs, images, numbers, or
formulas - Open-Source and Free (GPL, Runs on Linux)
5LON-CAPA Data
- Three kinds of growing data sets
- Educational resources web pages, demonstrations,
simulations, individualized problems, quizzes,
and examinations. - Information about users who create, modify,
assess, or use these resources. - Data about how students use and access the
educational materials
6MSU Fall 2003
- 40 courses used LON-CAPA at MSU
- Total student enrollment approximately 3,067 (out
of 13,400 total global student-users) - Disciplines included Advertising, Biochemistry,
Biology, Chemistry, Finance, Geology, Math,
Physics, Plant Biology, Statistics for Psychology
7Statement of problem
- LON-CAPA collects data for every single access to
the resources in both activity log and student
database - Logs are not only huge but also distributed and
specific to a web-based educational system
(LON-CAPA) - Intelligent automated tools needed to discover
relevant, useful, and interesting patterns - Apply the discovered rules to produce more
intelligent system
8Knowledge Discovery Process
- Data Integration, removing inconsistency,
- Data Cleansing, correcting errors, missing values
- Discretization, transform continuous to
categorical - Feature Selection, features are more relevant
- Mining process, rule discovery
- Post-processing,
- Large set rules ? simplify
- 1) More comprehensible, 2) More interesting
- Use combination of objective and subjective
approaches
9Data Mining Tasks
- Classification
- The goal is to predict the class variable based
on the feature values of samples Avoid
Overfitting - Clustering (unsupervised learning)
- Association Analysis
- Find the binary relationship among the data items
- Any feature variable can occur both in antecedent
and in the consequent of a rule.
10Contributions (1)
- Our claim is that data mining can help to design
better and more intelligent educational web-based
environment
Can help instructor to design the course more
effectively, detect anomaly
Can help students to use the resources more
efficiently
11Contributions (2)
12Contributions (3)
Can find some associative rules between
students educational activities
Can be used to identify those students who are at
risk, especially in very large classes
Can help instructors predict the approaches that
students will take for some types of problems
13Prediction student performance
- Statement of problem
- Classification
- Combination of Classifiers
- Weighting the features
- Using a Genetic Algorithm to find the best set of
weights - B. Minaei-Bidgoli, W.F. Punch, Using Genetic
Algorithms for Data Mining Optimization in an
Educational Web-based System, GECCO 2003,
2252-2263, July 2003 Chicago. - B. Minaei-Bidgoli, D.A. Kashy, G. Kortemeyer,
W.F. Punch, Predicting Student Performance An
Application of Data Mining Methods with an
educational Web-based System, (IEEE/ASEE) FIE
2003 Frontier In Education, Nov. 2003 Boulder - Clustering (Ensembles of multiple clusterings)
- Proposed work
- Tentative Schedule
14Data Set PHY183 SS02
- 227 students
- 12 Homework sets
- 184 Problems
- 80 MB activity log
- 26 MB useful data
- 220,000 transactions
- Extracted Features
- Total number of correct answers. (Success rate)
- Success at the first try
- Number of attempts to get answer
- Time spent until correct
- Total time spent on the problem
- Participating in the communication mechanisms
15Class Labels (3 possibilities)
2-Classes
3-Classes
9-Classes
16Classifiers
- Non-Tree Classifiers (Using MATLAB)
- Bayesian Classifier
- 1NN
- kNN
- Multi-Layer Perceptron
- Parzen Window
- Combination of Multiple Classifiers (CMC)
- Genetic Algorithm (GA), Optimizer
- Decision Tree-Based Software
- C5.0 (RuleQuest ltltC4.5ltltID3)
- CART (Salford-systems)
- QUEST (Univ. of Wisconsin)
- CRUISE use an unbiased variable selection
technique
17Fitness/Evaluation Function
- 5 classifiers
- Multi-Layer Perceptron 2 Minutes
- Bayesian Classifier
- 1NN
- kNN
- Parzen Window
- CMC 3 seconds
- Divide data into training and test sets (10-fold
Cross-Validation) - Fitness function performance achieved by
classifier
18Individual Representation
- The GA Toolbox supports binary, integer and
floating-point chromosome representations. - Chrom crtrp(N, FieldDR) creates a random
real-valued matrix of N x d, where N is number
of individuals (200) and FieldDR is a matrix of
size 2 x d and contains the boundaries of each
variable of an individual. - FieldDR 0 0 0 0 0 0 lower bound
- 1 1 1 1 1 1 upper bound
- Chrom 0.23 0.17 0.95 0.38 0.06 0.26
- 0.35 0.09 0.43 0.64 0.20
0.54 - 0.50 0.10 0.09 0.65 0.68
0.46 - 0.21 0.29 0.89 0.48 0.63
0.89
19Results of using GA
20GA Optimization Results
21Features importance
22Contribution of the classification
- A new approach to evaluating student usage of
web-based instruction - An approach that is easily adaptable to different
types of courses, different population sizes, and
different attributes to be analyzed - Rigorous application of known classifiers as a
means of analyzing and comparing use and
performance of students who have taken a
technical course that was partially/completely
administered via the web
23Clustering
- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Ensembles of Partitions via Data Resampling,
Proc. Intl. Conf. on Information Technology,
ITCC/IEEE 2004, in press - B. Minaei-Bidgoli, A. Topchy and W.F. Punch,
Effect of the Resampling Methods on Clustering
Ensemble Efficacy, prepared to submit to Intl.
Conf. on Machine Learning Models, Technologies
and Applications, 2004 - A. Topchy, B. Minaei-Bigoli, A.K. Jain, W.F.
Punch, Adaptive Clustering Ensembles, submitted
to Intl. Conf on Pattern Recognition, ICPR 2004 - Proposed work
- Tentative Schedule
24Motivation
- Combinations of classifiers proved to be very
effective in supervised learning framework, e.g.
bagging and boosting algorithms - In LON-CAPA, the course and student data are
distributed - Distributed data mining requires efficient
algorithms capable to integrate the solutions
obtained from multiple sources of data and
features - Ensembles of clusterings can provide novel,
robust, and stable solutions
25Taxonomy of Clustering Combination Approaches
26Resampling Methods
- Bootstrapping (Sampling with replacement)
- Create an artificial list by randomly drawing N
elements from that list. Some elements will be
picked more than once. - Statistically on average 37 of elements are
repeated - Subsampling (Sampling without replacement)
- Control over the size of subsample
27Related work on bootstrap partitioning
- Estimate the number of clusters
- (Jain Moreau1987), (Fridlyand Dudoit 2001),
- Clustering validity/reliability
- (Jain Moreau1987), (Fischer Buhmann 2003)
- Find a measure for clustering stability
- (Ben-Hur et. al, 2002),
- Clustering combination
- (Fridlyand Dudoit 2001) (Fischer Buhmann 2003)
- (Monti, et al., 2003.)
28Experiment Data sets
29Two-spiral and Halfrings data sets
Halfrings 400 patterns (100-300)
2-Spirals 200 patterns (100-100)
30Bootstrap results on Iris
31Subsampling on Halfrings
32Subsampling results on Galaxy/Star
33Error Rate for Individual Clustering
34Summary of the best results of Bootstrap
35(No Transcript)
36Additional Proposed work
- Statement of problem
- Classification (Prediction student performance)
- Clustering (Ensembles of multiple clusterings)
- Additional Proposed work
- Association Analysis
- Dynamic mining
- Tentative Schedule
37A sequence-based clustering
- The problem
- given students browsing data and course
contents, find clusters of learners with similar
behavior - order of browsed pages matters
- P ? P ? R1? R2 ? P ? A
- R1? R2 ? P ? A
- P ? A
- P ? P ? P ? P ? P ? P ? P ? P ? P ? A
- P ? P ? P ? P ? P
- R3 ? R2 ? P
- Cluster students based on a similarity function
38Web usage mining
- Many techniques have been investigated in the
e-commerce and CRM - Some can be adapted, some can not
- The goals are different,
- The user model is different
- Analyzing students interactions with the
LON-CAPA and take actions accordingly. It is the
path traversal pattern or similar to web
sequential pattern mining or web log mining.
39Association Analysis postprocessing
- Association rules mining studies the frequency of
items occurring together in a given set of data. - Solving the discretization problem for continuous
features - Post analysis of the discovered knowledge in
terms of the interestingness, usefulness and so
on. What is useful or interesting is a domain
dependent, need to talk to LON-CAPA
instructors/authors - Strategic use of data and discovered knowledge
40Dynamic mining LON-CAPA Examples
You are about to start a test. Other students
similar to you, who succeeded in this test, have
also accessed Section 5 of Chapter 3. You did
not. Would you like to access it now before
attempting the test? Yes No
Based on your time access to solve the problem
(Circular Motion), It seems that you are not
thinking about the problem, It is better to see
the following pages and then submit your
answers Motion in 2 Dimensions Force and Motion
Momentum and Collisions
Someone answered the question you posted on the
Bulletin Board yesterday. Would you like to read
it now? Yes No
Degree of difficulty of problem 3 in homework
set 5 jumped into greater than 90 in the first
5 hours of student access? There might be
something wrong in designing the problem. Would
you like to revise it now? Yes
No
41Conclusion Enhancing web-based learning
- L-C servers are tracking students activities in
large logs - The knowledge discovered could be analyzed and
evaluated by knowledge experts (off-line mining) - Integrated mining The patterns discovered are
fed back to a system that seamlessly and
transparently would make systems behave
intelligently. - We could pass the data through a magic box to
find some obscure patterns. - Tools to recommend tasks, automatically adapt
course materials - Tools can be personalized, manually or
automatically
42Tentative Schedule