Empirical Evaluation - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Empirical Evaluation

Description:

Empirical Evaluation – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 39

Provided by: johns81

Category:

more less

Transcript and Presenter's Notes

Title: Empirical Evaluation

1
Empirical Evaluation

Assessing usability

2
Agenda

Evaluation overview
Empirical studies
Process
Experimental design
Variables
Methods
Results

3
One Model
4
Evaluation

Earlier
Interpretive and Predictive
Heuristic evaluation, walkthroughs, ethnography
Now
User involved, evaluate usage
Experiments, usage observations, interviews...

5
Users Involved

Interpretive (naturalistic) vs. Empirical
Naturalistic
In realistic setting, usually includes some
detached observation, careful study of users
Empirical
People use system, manipulate independent
variables and observe dependent ones

6
Evaluation

Summative vs. Formative
What were they?

7
Evaluation Choices

Why done?
Summative
Formative

System already existsMeasuring against some
criteria
Inform and support iterative design
8
Evaluation Data Gathering

Information we gather about an interface can be
objective or subjective
Information also can be qualitative or
quantitative
Which are tougher to measure?

9
Empirical Usability Testing

Key
Perform experiments and observe users performing
benchmark tasks with interface under study
Gather data to learn about usability,
satisfaction, etc.
Use that to inform iterative redesign and
refinement

10
Validity Concerns

Are typical users tested?
Are typical tasks used?
Is the physical environment typical?
Is the social context appropriate?

11
Process

Steps in formative evaluation using experiments
Develop the experiment
Direct the evaluation sessions
Collect the data
Analyze the data
Draw conclusions to form a resolution for each
design problem
Redesign and implement the revised interface

12
Develop Experiment

Recruit participants
Use bribes cookies, wash their car, real
Make sure people have good attitude
Fit user population
3-5 people as pilots
Do they carry through to next round?
Maybe 1 out 3 moves on to next stage

13
Develop Experiment

Developing tasks
Benchmark tasks - gather quantitative data
Representative tasks - add breadth, can help
understand process
Tell what to do, not how to do it
Have introductory remarks and explanation written
down

14
Develop Experiment

Developing tasks (contd.)
Lab testing versus field testing issues
Informed consent form
Run pilot versions to shake out the bugs

15
Directing Sessions

Issues
Are you in same room or not?
Single person session or pairs of people
Objective data -- stay detached

16
Collecting Data

Data gathering
Note-taking
Audio and video tape
Instrumented user interface
Post-experiment questions and interviews

17
Collecting Data

Identifying errors can be difficult
Qualitative techniques
Think-aloud - can be very helpful
Post-hoc verbal protocol - review video
Critical incident logging - positive negative
Structured interviews - good questions
What did you like best/least?
How would you change..?
More to come next time...

18
Data Analysis

Simple analysis
Determine the means (time, of errors, etc.) and
compare with goal values (coming up)
Determine
Why did the problems occur?
What were their causes?

19
Objective Data
Today's Focus

Users interact with interface
You observe, monitor, calculate, examine,
measure,
Objective, scientific data gathering
Comparison to interpretive/predictive evaluation

20
Experiments

Utilize classical scientific method of
hypothesis, experiment and analysis
Key is methodology

21
Benchmark Tasks

Specific, clearly stated task for users to carry
out
Example Email handler
Find the message from Mary and reply with a
response of Tuesday morning at 11.
Users perform these under a variety of conditions
and you measure result

22
Experimental Methodology

Variables - facets or attributes of study that
can vary
We want to control all variables but the ones
were testing

subject experience gender
interface 1 vs. interface 2 lighting
intelligence location color vs. b/w
etc.
23
Control

Two methods of achieving it
Dont allow it to vary
Make subjects/attributes as representative of
population as possible
In both mean and range
Often, second method is all you can do

24
Types of Variables

Participants are a random variable
In experiment, we have independent and
dependent variables
Independent - What youre studying, what you
intentionally vary (eg, interface feature)
Dependent - What the study produces and you
tabulate, measure or examine (eg, time, number of
errors)

25
Example

Do people complete operations faster with a
black-and-white display or a color one?
Independent - color or b/w
Dependent - time it takes to complete

26
Experimental Designs

1. Within Subjects
Every participant provides a score for all levels
or conditions

Color
B/W P1 12 secs. 17
secs. P2 19 secs. 15
secs. P3 13 secs. 21
secs. ...
27
Experimental Designs

2. Between Subjects
Each participant provides results for only one
condition

Color B/W P1 12 secs.
P2 17 secs. P7 19 secs. P5
15 secs. P3 13 secs. P8 21 secs. ...
28
Which to Use?

What are the advantages and disadvantages of the
two techniques?

29
Within Advantages

Within subjects gives you more relative
information - Each person is their own control
Need bigger number of participants in between
subjects to average it out better

30
Between Advantages

Within subjects tests are much more liable to
ordering effects
Participant may learn from first condition
Fatigue may make second performance worse
Half go first in one condition, half go first in
other

31
Experimental Results

How does one know if an experiments results mean
anything or confirm any beliefs?
Example 20 people participated, 11 preferred
interface 1, 9 preferred interface 2
What do you conclude?

32
Hypothesis Testing

In experiment, we set up a null hypothesis to
check
Basically, it says that what occurred was simply
because of chance
For example, any participant has an equal chance
of preferring interface 1 over interface 2

33
Hypothesis Testing

If probability result happened by chance is low,
then your results are said to be significant
Statistical measures of significance levels
0.05 often used
Less than 5 possibility it occurred by chance

34
Example
Experiment 1 Group 1 Group 2 Mean 7
Mean 10 1,10,10 3,6,21
Experiment 2 Group 1 Group 2 Mean 7
Mean 10 6,7,8 8,11,11
35
Other Methods

Another kind of test is contingency table measure
May have two variables and two conditions

System A System B
Men
13
17
Women
11
15
36
Errors

Errors do occur
Types
Type I/False positive - You conclude there is a
difference, when in fact there isnt
Type II/False negative - You conclude there is no
different when there is
Type III