Title: Kein Folientitel
1Empirical Evaluations of Organizational Memory
Information Systems
Felix-Robinson Aschoff Ludger van Elst
2Empirical Evaluations of OMIS
1. Evaluation Definition and general
approaches 2. Contributions from related
fields 3. Implications for FRODO
3What is Empirical Evaluation?
Empirical evaluation refers to the appraisal of a
theory by observation in experiments. Chin,
2001
4Experiment or not?
Experiment less controlled exploratory study
Advantages influencing variables can be controlled causal statements can be infered more realistic (higher external validity) can be easier and faster to design
Problems artificial transfer to normal user context requires concrete hypothesis to find subjects which participate and pay them influencing variables cannot be controlled cooperation with people during their everyday work
5Artificial Intelligence vs Intelligence
Amplification
AI Expertsystems IA OMIS FRODO
Development Goal mind-imitating user-independent system working by itself hybrid solution cooperation between system and human user constant interaction
Evaluation focus on technical evaluation, if system meets its requirements focus must be on cooperation of system and user human-in-the-loop studies
6Empirical Evaluations of OMIS
1. Evaluation Definition and general approaches
2. Contributions from related fields
2. Contributions from related fields
3. Implications for FRODO
7Contributions from related fields
1. Knowledge Engineering 1.1 General
Approaches - The Sysiphus Initiative - High
Performance Knowledge Bases - Essential Theory
Approach - Critical Success Metrics 1.2
Knowledge Acquisition 1.3 Ontologies 2. Human
Computer Interaction 3. Information
Retrieval 4. Software Engineering
(Goal-Question-Metric Techniqe)
8The Sisyphus Initiative
A series of challenge problems for the
development of KBS by different research groups
with a focus on PSM Sisyphus-I Room
allocation Sisyphus-II Elevator
configuration Sisyphus-III Lunar igneous rock
classification Sisyphus-IVIntegration over the
web Sisyphus-V High quality knowledge base
initiative (hQkb) (Menzies, 99)
9Problems of the Sisyphus Initiative
- Sisyphus I II
- No higher referees
- No common metrics
- Focus on modelling of knowledge. Effort to build
a model of the - domain knowledge was usually not recorded.
- Important aspects like the accumulation of
knowledge and cost- - effectiveness calculation were not paid any
attention. - Sisyphus III
- Funding
- Willingness of researchers to participate
...none of the Sisyphus experiments have yielded
much evaluation information (though at the
time of this writing Sisyphus-III is not
complete) (Shadbolt et al 99)
10High Performance Knowledge Bases
- run by the Defence Advanced Research Project
Agency (DARPA) - in the USA
- goal to increase the rate at which knowledge
can be modified in a KBS - three groups of researchers
- 1) challenge problem developers
- 2) technology developers
- 3) integration teams
11HPKB Challenge Problem
- International Crisis Scenario in the Persian
Gulf - Hostilities between Saudia Arabia and Iran
- Iran closes the Strait of Hormuz to international
shipping - Integration of the following KBs
- the HPKB upper-level ontology (Cycorp)
- the World Fact Book knowledge base (Central
Intelligence Agency) - the Units and Measures Ontology (Stanford)
- Example Questions the system should be able to
answer - With what weapons is Iran capable of firing upon
tankers in the Strait of H.? - What risk would Iran face in closing the strait
to shipping? - Answer key to second question contains for
expample - Economic sanctions from Saudi Arabia, GCC, U.S.,
UN,, because Iran - violates an international norm promoting freedom
of the seas. - Source The Convention on the Law of the Sea
12HPKB Evaluation
- Systems answers were rated on four official
criteria - by challenge problem developers and subject
matter experts - Scale 0 3
- The correctness of the answer
- The quality of the explanation of the answer
- The completeness and quality of the cited sources
- The quality of the representation of the question
- two phase, test-retest schedule
13Essential Theory Approach
Menzies van Harmelen, 1999
Different schools of knowledge engineering
14Technical evaluation of ontologies
Gòmez-Pérez, 1999
1) Consistency 2) Completeness 3)
Conciseness 4) Expandability 5) Sensitiveness
- Errors in developing taxonomies
- Circularity errors
- Partition errors
- Redundancy errors
- Grammatial errors
- Semantic errors
- Incompleteness errors
15Related Fields
- Knowledge Acquisition
- Shadbolt, N., O'Hara, K. Crow, L. (1999).The
experimental evaluation - of knowledge acquisition techniques and methods
history, problems and - new directions. International Journal of
Human-Computer Studies, 51, - 729-755.
- Human Computer Interaction
- HCI is the study of how people design,
implement, and use - interactive computer systems, and how computers
affect - individuals and society. (Myers et al. 1996)
- facilitate interaction between users and
computer systems - make computers useful to a wider population
- Information Retrieval
- Recall and Precision
- e.g. key-word based IR vs. ontology-enhanced IR
- (Aitken Reid, 2000)
16Empirical Evaluations of OMIS
1. Evaluation Definition and general approaches
2. Contributions from related fields
3. Implications for FRODO
3. Implications for FRODO
17Guideline for Evaluation
- Formulate the main purposes of your framework or
application. - Formulate precise hypothesis.
- Define clear performance metrics.
- Standardize the measurement of your performance
metrics. - Be thourough with designing your (experimental)
research design. - Consider the use of inference statistics.
(Cohen, 1995) - Meet common standards for the report of your
results.
18Evaluation of Frameworks
Frameworks are general in scope and designed to
cover a wide range of tasks and problems.
The systematic control of influencing variables
becomes very difficult
Only a whole series of experiments across a
number of different tasks and a number of
different domains could controll for all the
factors that would be essential to take into
account. Shadbolt et al. 1999
- Approaches
- Sisyphus Initiative
- Essential Theory Approach (Menzies van
Harmelen, 1999)
19Problems with the Evaluation of FRODO
- Difficulty to control influencing variables when
evaluating - entire frameworks
- Frodo is not a running system (yet)
- Only few prototypic implementations that are
based on - FRODO
- Frodo is probably underspecified for evaluation
in many - areas
20Goal-Question-Metric Technique
Basili, Caldiera Rombach 1994
21Informal FRODO Projekt Goals
- FRODO will provide a flexible, scalable
framework for evolutionary growth - for distributed OMs
- FRODO will provide a comprehensive toolkit for
the automatic or semi- - automatic construction and maintenance of
domain ontologies - FRODO will improve information delivery by the
OM by developing more - integrated and easier adaptable DAU
techniques - FRODO will develop a methodology and tool for
business-process oriented - knowledge management relying on the notion of
weakly-structured workflows - FRODO is based on the assumption that a hybrid
solution where the system - supports humans in the decision-making
process is more appropriate for - OMIS than mind-imitating AI systems (IAgtAI)
22Task Type and Workflows
Task Type
negotiation co-decisison making
projects workflow-processes
unique low volume communication intensive
repetitive high volume heads down
23FRODO GQM Goal concerning workflows
GQM of FRODO
Purpose Compare
Quality issue efficiency
Object (process) task completion with workflows
Viewpoint viewpoint of the end-user
in the context of knowledge intensive tasks
Conceptual level (goals)
- GQM-Goals should specify
- a Purpose
- a quality Issue
- a measurement Object
- a Viewpoint
- Object of Measurement can be
- Products
- Processes
- Resources
24GQM Abstraction Sheet for FRODO
Quality factors Variation factors
efficiency of task completion Task types as described in Abecker 2001 (dimension negotiation, co-decision making, projects, workflow-processes)
Baseline hypothesis Impact of variation factors
the experimental design will provide a controll group for comparison KiTs are more succesfully supported by weakly-structured flexible workflows based on FRODO than by a-priori strictly structured workflows.
25GQM Questions and Metrics
What is the efficiency of task completion using
FRODO weakly-structured flexible workflows for
KiTs? What is the efficiency of task completion
using a-priori strictly-structured workflows for
KiTs? What is the efficiency or task completion
using FRODO weakly-structured flexible workflows
for classical workflow processes? Efficiency
of task completion quality of result expert
judgement divided by the time needed for
completion of the task. user-friendliness judged
by users
26Hypothesis
H1 For KiTs weakly-structured flexible workflows
as proposed by FRODO will yield higher efficiency
of task completion than a-priori
strictly-structured workflows. H2 For
classical workflow processes FRODO
weakly-structured flexible workflows will be as
good as a-priori strictly-structured workflows or
better.
27Experimental Design
2 x 2 factorial experiment independent variables
workflows task type Dependent
variable efficiency of task completion
workflow workflow
task type weakly-struct. flex. wf/ KiT strictly-struct. wf/ KiT
task type weakly-struct. flex. wf/ classical wf process strict-struct. wf/ classical wf process
Within Subject Design vs. Between Subject
Design Randomized Groups (15-20 for statistical
inference) Possibilities Degradation Studies,
Benchmarking
28Empirical Evaluation of Organizational Memory
Information Systems Felix-Robinson Aschoff
Ludger van Elst
1 Introduction 2 Contributions from Related
Fields 2.1 Knowledge Engineering 2.1.1
Generel Methods and Guidelines (Essential
Theories, Critical Success Metrics, Sisyphus,
HPKB) 2.1.2 Knowledge Acquisition 2.1.3
Ontologies 2.2 Human Computer Interaction 2.3
Information Retrieval 2.4 Software Engineering
(Goal-Question-Metric Technique) 3 Implications
for Organizational Memory Information
Systems 3.1 Implications for the evaluation of
OMIS 3.2 Relevant aspects of OMs for
evaluations and rules of thumb for
conducting evaluative research 3.3 Preliminary
sketch of an evaluation of FRODO References Appe
ndix A Technical evaluation of Ontologies
29References
Aitken, S. Reid, S. (2000). Evaluation of an
ontology-based information retrieval tool.
Proceedings of 14th European Conference on
Artificial Intelligence. http//delicias.dia.fi.
upm.es/WORKSHOP/ECAI00/accepted-papers.html
Basili, V.R., Caldiera, G. Rombach, H.D.
(1994). Goal question metric paradigm. In John J.
Marciniak, editor, Encyclopedia of Software
Engineering, volume 1, 528532. John Wiley
Sons Berger, B., Burton, A.M., Christiansen, T.,
Corbridge, C., Reichelt, H. Shadbolt,
N.R.(1989) Evaluation criteria for knowledge
acquisition, ACKnowledgeproject deliverable
ACK-UoN-T4.1-DL-001B. University of Nottingham,
Nottingham Chin, D. N. (2001). Empirical
evaluation of user models and user-adapted
systems. User Modeling and User-Adapted
Interaction, 11 181-194 Cohen, P. (1995).
Empirical Methods for Artificial Intelligence.
Cambridge MIT Press. Cohen, P.R., Schrag,R.,
Jones E., Pease, A., Lin, A., Starr, B., Easter,
D., Gunning D., Burke, M. (1998). The DARPA
high performance knowledge bases project.
Artificial Intelligence Magazine. Vol. 19, No. 4,
pp.25-49. Gómez-Pérez, A. (1999). Evaluation of
taxonomic knowledge in ontologies and knowledge
bases. Proceedings of KAW'99. http//sern.ucalga
ry.ca/KSI/KAW/KAW99/papers.html Grüninger, M.
Fox, M.S. (1995) Methodology for the design and
evaluation of ontologies, Workshop on Basic
Ontological Issues in Knowledge Sharing,
IJCAI-95, Montreal. Hays, W. L. (1994).
Statistics. Orlando Harcourt Brace. Kagolovsky,
Y., Moehr, J.R. (2000). Evaluation of Information
Retrieval Old problems and new perspectives.
Proceedings of 8th International Congress on
Medical Librarianship. http//www.icml.org/tuesday
/ir/kagalovosy.htm Martin, D.W. (1995). Doing
Psychological Experiments. Pacific Grove
Brooks/Cole. Menzies, T. (1999a). Critical sucess
metrics evaluation at the business level.
International Journal of Human-Computer Studies,
51, 783-799. Menzies, T. (1999b). hQkb - The high
quality knowledge base initiative (Sisyphus V
learning design assessment knowledge).
Proceedings of KAW'99. http//sern.ucalgary.ca/K
SI/KAW/KAW99/papers.html Menzies, T. van
Harmelen, F. (1999). Editorial Evaluating
knowledge engineering techniques. International
Journal of Human-Computer Studies, 51,
715-727. Myers, B., Hollan, J. Cruz, I. (Ed.)
(1996). Strategic directions in human computer
interaction. ACM Computing Surveys, 28, 4 Nick,
M., Althoff, K., Tautz, C. (1999). Facilitating
the practical evaluation of knowledge-based
systems and organizational memories using the
goal-question-metric technique. Proceedings of
KAW 99. http//sern.ucalgary.ca/KSI/KAW/KAW99/pap
ers.html Shadbolt, N., O'Hara, K. Crow, L.
(1999).The experimental evaluation of knowledge
acquisition techniques and methods history,
problems and new directions. International
Journal of Human-Computer Studies, 51,
729-755. Tallis, M., Kim, J., Gil, Y. (1999).
User studies of knowledge acquisition tools
methodology and lessons learned. Proceedings of
KAW 99 http//sern.ucalgary.ca/KSI/KAW/KAW99/pa
pers.html Tennison, J., OHara, K., Shadbolt, N.
(1999) Evaluating KA tools Lessons from an
experimental evaluation of APECKS. Proceedings of
KAW99 http//sern.ucalgary.ca/KSI/KAW/KAW99/pap
ers/Tennison1/
30Tasks for Workflow Evaluation
Possibles Tasks for workflow evaluation
experiment KiT Please write a report about
your personal greatest learning
achievements during the last semester. Find
sources related to these scientific
areas in the Internet. Prepare a Power Point
Presentation. To help you with these
task you will be provided with FRODO
weakly-structured wf / classical workflow Simple
structured task Please implement Netscape on
your computer and use the Internet to
find all universities in Iowa that offer
computer sciences. Use e-mail to ask for further
information. To help you with these task
you will be provided with FRODO
weakly-structured wf / classical workflow
31(No Transcript)
32GQM Goals for CBR-PEB
GQM of CBR-PEB Nick, Althoff, Tautz, 1999 GOAL 2 Economic utility
Analyze retrieved information
for the purpose of Monitoring
with respect to Economic utility
From the viewpoint of CBR system developers
in the context of Decision supp. for CBR devel.
Conceptual level (goals)
- GQM-Goals should specify
- a Purpose
- a quality Issue
- a measurement Object
- a Viewpoint
- Object of Measurement can be
- Products
- Processes
- Resources
33GQM Abstraction Sheet for CBR-PEB
Goal 2 Economic Utility for CBR - PEB
Quality factors Variation factors
1. Similarity of retrieved information as modeled in CBR-PEB (Q-12) 2. Degree of maturity (desiredmax.) development, prototype, pilot use (Q-13) ... 1. amount of background knowledge a. number of attributes (Q-8.1.1) ... 2. Case origin university, industrial research, industry ...
Baseline hypothesis Impact of variation factors
1.M.M.0.2 N.N.0.5 (scale 0..1) ... The estimates are on average. 1. The higher the amount of backround knowledge, the higher the similarity. (Q-8) 2. The more industrial the case origin, the higher the degree of maturity. (Q-9)...
34GQM Questions and Metrics
GQM plan for CBR-PEB Goal 2 Economic Utility
Q-9 What is the impact of the case origin on the
degree of maturity? Q-9.1 What
is the case origin ? M-9.1.1 per retrieval
attempt for each chosen case case
origin university, industrial research,
industry Q-9.2 What is the degree of maturity
of the system? M-9.2.1 per retrieval attempt
for each chosen case case
attribute status prototype, being
developed, pilot system, application in
practical use unknown
35FRODO GQM-Goal concerning Ontologies
For the circumstances FRODO is designed for
hybrid solutions are more successful than AI
solutions Purpose Compare Issue the
efficiency of Object (process) ontology
construction and use with respect to Stability,
Sharing Scope, Formality of Information Viewpoint
from the users viewpoint
36GQM Abstraction Sheet for FRODO (Ont.)
Quality factors Variation factors
efficiency of ontology construction and use Sharing Scope Stability Formality
Baseline hypothesis Impact of variation factors
the experimental design will provide a controll group for comparison high Sharing Scope, medium Stability, low Formality -gtFRODO more successf. low Sharing Scope, high Stability, high Formality -gtAI more successful
37GQM Questions and Metrics
What is the efficiency of the ontology
construction and use process using FRODO for a
situation with high sharing scope, medium
stability and low formality? What is the
efficiency of the ontology construction and use
process using FRODO for a situation with low
sharing scope, high stability and high
Formality? What is the efficiency of the
ontology construction and use process using AI
systems for these situations. Metrics efficiency
of ontology construction number of definitions
/ time efficiency of ontology use Information
Retrieval (Recall and Precision)
38Hypothesis
H1 for Situation 1 (high sharing scope, medium
stability, low formality) FRODO will yield a
higher efficiency of ontology construction and
use. H2 for Situation 2 (low sharing scope,
high stability and high formality) an AI system
will yield higher efficiency of ontology
construction and use.
39Experimental Design
2 x 2 factorial experiment independent variables
Situation (1/2) Systems
(FRODO/AI) Dependent variable efficiency of
ontology construction and use
Situation Situation
System Situation 1/ FRODO Situation 2/ FRODO
System Situation 1/ AI System Situation 2/ AI System
Within Subject Design vs. Between Subject
Design Randomized Groups (15-20 for statistical
inference)
40Big evaluation versus small evaluation
Van Harmelen, 98
- Distinguish different types of evaluation
- Big evaluation evaluation of KA/KE
methodologies - Small evaluation evaluation of KA/KE
components (e.g. a particular PSM) - Micro evaluation evaluation of KA/KE product
(e.g. a single system) - Some are more interesting than others
- Big evaluation is impossible to control
- Micro evaluation is impossible to generalize
- Small evaluation might just be the only option
41Knowledge Acquisition
- Problems with the Evaluation of the KA process
(Shadbolt et al, 1999) - the availability of human experts
- the need for a gold standard of knowledge
- the question of how many different domains and
tasks should be - included
- the difficulty of isolating the value-added of a
single technique or - tool
- how to quantify knowledge and knowledge
engineering effort
42Knowledge Acquisition
4) the difficulty of isolating the value-added
of a single technique or tool
- Conduct a series of experiments
- Test different implementations of the same
techniqe against each other or against a paper
and pencil version - Test groups of tools in complementary pairings or
different orderings of the same set of tools - Test the value of single sessions against
multiple sessions and the effect of feedback in
multiple sessions - Exploit techniques from the evaluation of
standard software to control for effects from
interface, implementation etc. - Problem Scale-up of experimental programme
43Essential Theory Approach
- Identify a process of interest.
- Create an essential theory t for that process.
- Identify some competing process description, ?
T. - Design a study that explores core pathways in
both ? T and T. - Acknowledge that your study may not be
definitive.
Advantage Broad conceptual approach results
are of interest for the entire
comunity Problem Interpretation of results is
difficult (due to KE school or due to concrete
technology like implementation, interface etc?)
44Three Aspects of Ontology Evaluation
- Three aspects of evaluating ontologies
- the process of constructing the ontology
- the technical evaluation
- end user assessment and ontology-user interaction
45Assessment and Ontology-User Interaction
Assessment is focused on judging the
understanding, usability, usefulness,
abstraction, quality and portability of the
definitions by the users-point of view.
(Gómez-Pérez, 1999)
- Ontology-user interaction in OMIS
- more dynamic
- success of OMIS rely on active use
- users with heterogen skills, backgrounds and
tasks