Classroom Data Analysis - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Classroom Data Analysis

Description:

But we captured full source etc. at each compile & run ... Items in orange: could also be sped up by new system, but not yet part of model ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 16

Provided by: johnrg5

Category:

more less

Transcript and Presenter's Notes

Title: Classroom Data Analysis

1
Workflow Analysis of Student Data
John R. Gilbert and Viral Shah
2
Goals and Caveats

Our goal is to develop methods for using measured
data to evaluate hypotheses about workflow.
These are preliminary experiments using data from
one pilot classroom study (UCSB CS240A, spring
2004).
Dont put too much trust in this data! (We are
still learning what data to gather and how.)
Therefore, this talk wont defend any particular
conclusions about workflows. Rather, we aim to
show that a data-based analytical approach is
promising for further development
inform data-gathering in upcoming studies.

3
Background

System asked student for a reason for each
compile
We didnt trust the answers . . .
But we captured full source etc. at each compile
run
So, we completely re-ran the student experience
here, one assignment from one class
17 student histories
about one day of 32-processor cluster time
Used heuristics to assign reasons for compiles

4
Scripted questionnaire

What is the reason for this compile/run?
Learn / experiment with compiler
Adding serial functionality
Parallelizing
Performance tuning
Fixing compile time error
Fixing run time error

5
Heuristics to deduce answers

What is the reason for this compile/run?
Learn / experiment with compiler
Few MPI calls, LOC unchanged
Adding serial functionality
Few MPI calls, LOC changes
Parallelizing
of MPI calls changes
Performance tuning
Correct, run time changes
Fixing compile time error
Previous compile failed
Fixing run time error
Previous run failed on random test case

6
Example of student trace

Student 2 70 runs in 7 days, 12 hr, 0 sec
rev 1 after 0s seq runerr
rev 2 after 4d 4h 27m 7s seq par
cplerr
rev 3 after 41s seq cplerr
rev 4 after 19s cplerr
rev 5 after 21s cplerr
. . . . . . . . . . . .
rev 14 after 21m 49s cplerr
rev 15 after 6m 32s seq cplerr
rev 16 after 38s seq run ,
time7.47479
rev 17 after 57s seq run ,
time4.58393
rev 18 after 2m 25s run ,
time7.68748
rev 19 after 1m 14s run ,
time4.54306
. . . . . . . . . . . .
rev 64 after 2m 22s par runerr
rev 65 after 5m 0s par crash
rev 66 after 6m 43s run ,
time10.6737

7
Summary of student experience

Student 1 324 runs in 22h 41m 37s best
3.386
Student 2 70 runs in 8h 35m 11s best
2.346
Student 3 36 runs in 4h 59m 54s best
1.854
Student 4 216 runs in 11h 42m 54s best
6.509
Student 5 173 runs in 14h 57m 35s best
5.459
Student 6 122 runs in 8h 27m 10s best
2.606
Student 7 174 runs in 18h 17m 8s best
3.576
Student 8 536 runs in 27h 10m 16s best
7.540
Student 9 72 runs in 17h 23m 32s best
5.533
Student 10 110 runs in 11h 43m 29s best
5.183
Student 11 325 runs in 41h 18m 12s best
1.684
Student 12 188 runs in 18h 24m 39s best
4.428

8
Compiles, Runs, Correct runs

Distribution varies significantly by programmer

9
LOC profiles

All kinds of workflows are observed in the class

10
Timed Markov processes

A timed Markov process is a Markov process with
associated state dwell times
The dwell times depend on how the state is exited
prob(B A) is the probability the next state is
B given that the current state is A
time(A B) is the dwell time spent in state A
given that the next state is B
The dwell time is a random variable in general

From Burton Smith and David Mizell, CRAY
11
General model of researcher workflow

express new scientific theory in informal
notation
express new theory in VHLL (Matlab, Perl, OS360
JCL, etc.)
debug code on small data sets
test new theory on small, medium data sets --
compare against known results, previous models
redesign program for HPC system
write program for HPC system
compile for debug
test HPC code
debug HPC code
select medium-to-large data set for performance
testing/optimization
optimize HPC code for performance
do performance test run for HPC code
select, obtain large-scale data set
structure data set for large-scale computation
test large-scale data set arrangement for
correctness
test for expected performance
design/implement visualization approach for
larger-scale problems
run performance-tuned version against large-scale
data set
visualize results

Items in red new system could shorten time
well try to model these
Items in orange could also be sped up by new
system, but not yet part of model
From Burton Smith and David Mizell, CRAY
12
Researcher workflow model
From Burton Smith and David Mizell, CRAY
13
Fitting data to Cray model
Transition
probabilities program compile1 test
debug run optimize compile2 program
23.7
compile1 100.0
100.0
test 100.0

debug 71.3
26.6 run
4.8
100.0 optimize
69.9
compile2
100.0 done
0.2 3.5

Average dwell
times
program compile1 test debug
run optimize compile2
program 5s
compile1 4m 28s 6m 20s
test 49s
debug 5s
5s
run 9s
30s
optimize
4s
compile2
10m 29s
done 5s
3s

14
Fitting data to Cray model
Formulate
Program
1.0 / 268s
1.0 / 380s
Compile
Debug
.237 / 5s
1.0 / 49s
.713 / 5s
Test
.002 / 5s
.048 / 9s
1.0 / 629s
Compile
Optimize
.266 / 5s
1.0 / 30s
Run
.699 / 4s
.035 / 3s
15
Conclusions

Remember all the caveats!
This is very preliminary, no conclusions about
workflows yet
Fitting measured data to hypothesized workflows
looks promising
Were getting better at knowing what to measure
and how
See Vics talk yesterday
Formulation time, programming vs debugging, . . .
?
Measure automatically when possible
Lots more data coming soon from classroom
experiments
Should be able to do this with professional data
too
Want to compare different languages / apps / . .
.
Want to use principled approach to estimating
statistics of dwell times, evaluating competing
state models, etc.